Automatic spell checking of words in a text - language-agnostic

[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohyperiofier", the checker should remember it for a while and when I type "Mung<Ctrl+Space>" in another scene, it should offer it again. If I don't use the word for, say, a few days, it should forget about it.
At the same time, I'd like to avoid adding typos to the dictionary.[/EDIT]
I want to write a text editor for SciFi stories. The editor should offer word completion for any word used anywhere in the current story. It will only offer a single scene of the story for editing (so you can easily move scenes around).
This means I have three sets:
The set of all words in all other scenes
The set of word in the current scene before I started editing it
The set of words in the current editor
I need to store the sets somewhere as it would be too expensive to build the list from scratch every time. I think a simple plain text file with one-word-per-line is enough for that.
As the user edits the scene, we have these situations:
She deletes a word. This word is not used anywhere else in the current scene.
She types a word which is new
She types a word which already exists
She types a word which already exists but makes a typo
She corrects a typo in a word which is in set #2.
She corrects a typo in a word which is in set #1 (i.e. the typo is elsewhere, too).
She deletes a word which she plans to use again. After the deletion, the word is no longer in the sets #1 and #3, though.
The obvious strategy would be to rebuilt the word sets when a scene is saved and build the set #1 from a word-list file per scene.
So my question is: Is there a clever strategy to keep words which aren't used anywhere anymore but still be able to phase out typos? If possible, this strategy should work in the background without the user even noticing what is going on (i.e. I want to avoid to have to grab the mouse to select "add word to dictionary" from the menu).
[EDIT] Based on a comment from grieve

So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model.
I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.

The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.
For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.

Reminds me of what I have been told about garbage collecting in modern LISP implementations :
data when created is put in "pool 1",
when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.
Then any remaining entry is moved to pool 2.
Pool 2 is examined only when there is a need to more memory than pool 1 can release.
Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.
The idea is to put dynamically the data in a pool corresponding to its lifetime...

Related

How to restrict flushing of specific objects using before_flush()

I have three tables: fits, character, and skills. Each Character has a list of Skills, of which the skill level can be changed. Each Fit belongs to a Character, and a Character may own any number of fits.
I would like the user to be able to change the levels of the skills belonging to a Character, but on a temporary basis. That is, when they change it, they can close the program and lose the changes, or choose to save the changes they have done. This seems difficult to do for three reasons:
Any time the Character object changes (which has a list of Skill objects), it is added as a dirty object to the session. This leads into the next problem...
Any time other things in the program change (the Fit object is modified, the user creates new Characters, etc), the program does a flush/commit, which would include the temporary changes that were made to previous characters
I do not wish to expunge() the character from the session, because when a new fit is loaded, it will load a new Character object fresh from the database. This is unwanted as I wish to use the modified Character for all fits that are assigned to that characters, and I do not want extra Character objects floating around.
Basically, I want the user to tweak the Character object without fear of having it saved with a session flush/commit from another change.
I have thought about using SQLalchemys before_flush event, but examples of using it seem to be sparse. I envision setting a property on the Character whenever it is changed, and then checking for this property before flushing. If it has this property, remove it from the flush pool and flush/commit all other changes. I think this would be exactly what I need, however I'm not sure how to work with the events.
There are many ways to do this. One possible solution is to create a 'transient' to store the temporary skill. Create a reconstructor
def Skill:
#orm.reconstructor
def init_on_load(self):
self.current_skill = self.skill
def save_current_skill():
self.skill = self.curent_skill
Don't forget to override the constructot to create the filed

What is differential execution?

I stumbled upon a Stack Overflow question, How does differential execution work?, which has a VERY long and detailed answer. All of it made sense... but when I was done I still had no idea what the heck differential execution actually is. What is it really?
REVISED. This is my Nth attempt to explain it.
Suppose you have a simple deterministic procedure that executes repeatedly, always following the same sequence of statement executions or procedure calls.
The procedure calls themselves write anything they want sequentially to a FIFO, and they read the same number of bytes from the other end of the FIFO, like this:**
The procedures being called are using the FIFO as memory, because what they read is the same as what they wrote on the prior execution.
So if their arguments happen to be different this time from last time, they can see that, and do anything they want with that information.
To get it started, there has to be an initial execution in which only writing happens, no reading.
Symmetrically, there should be a final execution in which only reading happens, no writing.
So there is a "global" mode register containing two bits, one that enables reading and one that enables writing, like this:
The initial execution is done in mode 01, so only writing is done.
The procedure calls can see the mode, so they know there is no prior history.
If they want to create objects they can, and put the identifying information in the FIFO (no need to store in variables).
The intermediate executions are done in mode 11, so both reading and writing happen, and the procedure calls can detect data changes.
If there are objects to be kept up to date,
their identifying information is read from and written to the FIFO,
so they can be accessed and, if necessary, modified.
The final execution is done in mode 10, so only reading happens.
In that mode, the procedure calls know they are just cleaning up.
If there were any objects being maintained, their identifiers are read from the FIFO, and they can be deleted.
But real procedures do not always follow the same sequence.
They contain IF statements (and other ways of varying what they do).
How can that be handled?
The answer is a special kind of IF statement (and its terminating ENDIF statement).
Here's how it works.
It writes the boolean value of its test expression, and it reads the value that the test expression had last time.
That way, it can tell if the test expression has changed, and take action.
The action it takes is to temporarily alter the mode register.
Specifically, x is the prior value of the test expression, read from the FIFO (if reading is enabled, else 0), and y is the current value of the test expression, written to the FIFO (if writing is enabled).
(Actually, if writing is not enabled, the test expression is not even evaluated, and y is 0.)
Then x,y simply MASKs the mode register r,w.
So if the test expression has changed from True to False, the body is executed in read-only mode. Conversely if it has changed from False to True, the body is executed in write-only mode.
If the result is 00, the code inside the IF..ENDIF statement is skipped.
(You might want to think a bit about whether this covers all cases - it does.)
It may not be obvious, but these IF..ENDIF statements can be arbitrarily nested, and they can be extended to all other kinds of conditional statements like ELSE, SWITCH, WHILE, FOR, and even calling pointer-based functions. It is also the case that the procedure can be divided into sub-procedures to any extent desired, including recursive, as long as the mode is obeyed.
(There is a rule that must be followed, called the erase-mode rule, which is that in mode 10 no computation of any consequence, such as following a pointer or indexing an array, should be done. Conceptually, the reason is that mode 10 exists only for the purpose of getting rid of stuff.)
So it is an interesting control structure that can be exploited to detect changes, typically data changes, and take action on those changes.
Its use in graphical user interfaces is to keep some set of controls or other objects in agreement with program state information. For that use, the three modes are called SHOW(01), UPDATE(11), and ERASE(10).
The procedure is initially executed in SHOW mode, in which controls are created, and information relevant to them populates the FIFO.
Then any number of executions are made in UPDATE mode, where the controls are modified as necessary to stay up to date with program state.
Finally, there is an execution in ERASE mode, in which the controls are removed from the UI, and the FIFO is emptied.
The benefit of doing this is that, once you've written the procedure to create all the controls, as a function of the program's state, you don't have to write anything else to keep it updated or clean up afterward.
Anything you don't have to write means less opportunity to make mistakes.
(There is a straightforward way to handle user input events without having to write event handlers and create names for them. This is explained in one of the videos linked below.)
In terms of memory management, you don't have to make up variable names or data structure to hold the controls. It only uses enough storage at any one time for the currently visible controls, while the potentially visible controls can be unlimited. Also, there is never any concern about garbage collection of previously used controls - the FIFO acts as an automatic garbage collector.
In terms of performance, when it is creating, deleting, or modifying controls, it is spending time that needs to be spent anyway.
When it is simply updating controls, and there is no change, the cycles needed to do the reading, writing, and comparison are microscopic compared to altering controls.
Another performance and correctness consideration, relative to systems that update displays in response to events, is that such a system requires that every event be responded to, and none twice, otherwise the display will be incorrect, even though some event sequences may be self-canceling. Under differential execution, update passes may be performed as often or as seldom as desired, and the display is always correct at the end of a pass.
Here is an extremely abbreviated example where there are 4 buttons, of which buttons 2 and 3 are conditional on a boolean variable.
In the first pass, in Show mode, the boolean is false, so only buttons 1 and 4 appear.
Then the boolean is set to true and pass 2 is performed in Update mode, in which buttons 2 and 3 are instantiated and button 4 is moved, giving the same result as if the boolean had been true on the first pass.
Then the boolean is set false and pass 3 is performed in Update mode, causing buttons 2 and 3 to be removed and button 4 to move back up to where it was before.
Finally pass 4 is done in Erase mode, causing everything to disappear.
(In this example, the changes are undone in the reverse order as they were done, but that is not necessary. Changes can be made and unmade in any order.)
Note that, at all times, the FIFO, consisting of Old and New concatenated together, contains exactly the parameters of the visible buttons plus the boolean value.
The point of this is to show how a single "paint" procedure can also be used, without change, for arbitrary automatic incremental updating and erasing.
I hope it is clear that it works for arbitrary depth of sub-procedure calls, and arbitrary nesting of conditionals, including switch, while and for loops, calling pointer-based functions, etc.
If I have to explain that, then I'm open to potshots for making the explanation too complicated.
Finally, there are couple crude but short videos posted here.
** Technically, they have to read the same number of bytes they wrote last time. So, for example, they might have written a string preceded by a character count, and that's OK.
ADDED: It took me a long time to be sure this would always work.
I finally proved it.
It is based on a Sync property, roughly meaning that at any point in the program the number of bytes written on the prior pass equals the number read on the subsequent pass.
The idea behind the proof is to do it by induction on program length.
The toughest case to prove is the case of a section of program consisting of s1 followed by an IF(test) s2 ENDIF, where s1 and s2 are subsections of the program, each satisfying the Sync property.
To do it in text-only is eye-glazing, but here I've tried to diagram it:
It defines the Sync property, and shows the number of bytes written and read at each point in the code, and shows that they are equal.
The key points are that 1) the value of the test expression (0 or 1) read on the current pass must equal the value written on the prior pass, and 2) the condition of Sync(s2) is satisfied.
This satisfies the Sync property for the combined program.
I read all the stuff I can find and watched the video and will take a shot at a first-principles description.
Overview
This is a DSL-based design pattern for implementing user interfaces and perhaps other state-oriented subsystems in a clean, efficient manner. It focuses on the problem of changing the GUI configuration to match current program state, where that state includes the condition of GUI widgets themselves, e.g. the user selects tabs, radio buttons, and menu items, and widgets appear/disappear in arbitrarily complex ways.
Description
The pattern assumes:
A global collection C of objects that needs periodic updates.
A family of types for those objects, where instances have parameters.
A set of operations on C:
Add A P - Put a new object A into C with parameters P.
Modify A P - Change the parameters of object A in C to P.
Delete A - Remove object A from C.
An update of C consists of a sequence of such operations to transform C to a given target collection, say C'.
Given current collection C and target C', the goal is to find the update with minimum cost. Each operation has unit cost.
The set of possible collections is described in a domain-specific language (DSL) that has the following commands:
Create A H - Instantiate some object A, using optional hints H, and add it to the global state. (Note no parameters here.)
If B Then T Else F - Conditionally execute command sequence T or F based on Boolean function B, which can depend on anything in the running program.
In all the examples,
The global state is a GUI screen or window.
The objects are UI widgets. Types are button, dropdown box, text field, ...
Parameters control widget appearance and behavior.
Each update consists of adding, deleting, and modifying (e.g. relocating) any number of widgets in the GUI.
The Create commands are making widgets: buttons, dropdown boxes, ...
The Boolean functions depend on the underlying program state including the condition of GUI controls themselves. So changing a control can affect the screen.
Missing links
The inventor never explicitly states it, but a key idea is that we run the DSL interpreter over the program that represents all possible target collections (screens) every time we expect any combination of the Boolean function values B has changed. The interpreter handles the dirty work of making the collection (screen) consistent with the new B values by emitting a sequence of Add, Delete, and Modify operations.
There is a final hidden assumption: The DSL interpreter includes some algorithm that can provide the parameters for the Add and Modify operations based on the history of Creates executed so far during its current run. In the GUI context, this is the layout algorithm, and the Create hints are layout hints.
Punch line
The power of the technique lies in the way complexity is encapsulated in the DSL interpreter. A stupid interpreter would start by Deleting all the objects (widgets) in the collection (screen), then Add a new one for each Create command as it sees them while stepping through the DSL program. Modify would never occur.
Differential execution is just a smarter strategy for the interpreter. It amounts to keeping a serialized recording of the interpreter's last execution. This makes sense because the recording captures what's currently on the screen. During the current run, the interpreter consults the recording to make decisions about how to bring about the target collection (widget configuration) with operations having least cost. This boils out to never Deleting an object (widget) only to Add it again later for a cost of 2. DE will always Modify instead, which has a cost of 1. If we happen to run the interpreter in some case where the B values have not changed, the DE algorithm will generate no operations at all: the recorded stream already represents the target.
As the interpreter executes commands, it is also setting up the recording for its next run.
An analogous algorithm
The algorithm has the same flavor as minimum edit distance (MED). However DE is a simpler problem than MED because there are no "repeated characters" in the DE serialized execution strings as there are in MED. This means we can find an optimal solution with a straightforward on-line greedy algorithm rather than dynamic programming. That's what the inventor's algorithm does.
Strengths
My take is that this is a good pattern for implementing systems with many complex forms where you want total control over placement of widgets with your own layout algorithm and/or the "if else" logic of what's visible is deeply nested. If there are K nests of "if elses" N deep in the form logic, then there are K*2^N different layouts to get right. Traditional form design systems (at least the ones I've used) don't support larger K, N values very well at all. You tend to end up with large numbers of similar layouts and ad hoc logic to select them that's ugly and hard to maintain. This DSL pattern seems a way to avoid all that. In systems with enough forms to offset the DSL interpreter's cost, it would even be cheaper during initial implementation. Separation of concerns is also a strength. The DSL programs abstract the content of forms while the interpreter is the layout strategy, acting on hints from the DSL. Getting the DSL and layout hint design right seems like a significant and cool problem in itself.
Questionable...
I'm not sure that avoiding Delete/Add pairs in favor of Modify is worth all the trouble in modern systems. The inventor seems most proud of this optimization, but the more important idea is a concise DSL with conditionals to represent forms, with layout complexity isolated in the DSL interpreter.
Recap
The inventor's has so far has focused on deep details of how the interpreter makes its decisions. This is confusing because it's directed at trees while the forest is of greater interest. This is a description of the forest.

Best usability practice for accepting long-ish account numbers

A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it assumed that it was done in order to preserve data quality and possibly to provide a better user experience also.
Other more generic examples include Phone with Area Code (10 consecutive digits versus [3,3,4]) and of course SSN (9 digits versus [3,2,4])
It got me wondering whether there are any known standards out there on the topic? When do you split up your ID#? Specifically with regards to user experience and minimizing data entry errors.
I know there was some research into this, the most I can find at the moment is the Wikipedia article on Short-term memory, specifically chunking. There's also The Magical Number Seven, Plus or Minus Two.
When I'm providing ID's to end users I, personally like to break it up into blocks of 5 which appears to be the same convention the original designer of your system used. I've got no logical reason that I can give you for having picked this number other than it "feels right". Short of being able to spend a lot of money on carrying out a study, "gut instinct" and following contentions from other systems is probably the way to go.
That said, if you can make the UI more usable to the user by:
Automatically moving from the end of one field to the start of another when it's complete
Automatically moving from the start of one field to the prior field and deleting the last character when the user presses delete in an empty field that isn't the first one
OR
Replacing it with one long field that has some form of "input mask" on it (not sure if this is doable in plain HTML, but it may be feasible using one of the UI frameworks) so it appears like "_____ - _____ - _____ - ____" and ends up looking like "1235 - 54321 - 12345 - 1234"
It would almost certainly make them happier!
Don't know about standards, but from a personal point of view:
If there are multiple fields, make sure the cursor moves to the next field once a field is full.
If there's only one field, allow spaces/dashes/whatever to be used in that field because you can filter them out. It's really annoying when sites/programs force you to enter dates in "dd/mm/yyyy" format, for example, meaning the day/month must be padded with zeroes. "23/8/2010" should be acceptable.
You need to consider the wider context of your particular application. There are always pros and cons of any design decision, but their impact changes depending on the situation, so you have to think every time.
Splitting the long number into several fields makes it easier to read, especially if you choose to divide the number the same way as most of your users. You can also often validate the input as soon as the user goes to the next field, so you indicate errors earlier.
On the other hand, users rarely type long numbers like that nowadays: most of the time they just copy-paste them from whatever note-keeping solution they have chosen, in whatever format they have it there. That means that a single field, without any limit on lenght or allowed characters suddenly makes a lot of sense -- you can filter the characters out anyways (just make sure you display the final form of the number to the user at some point). There are also issues with moving the focus between fields, with browsers remembering previous values (you just have to select one number, not 4 parts of the same number then), etc.
In general, I would say that as browsers slowly become more and more usable, you should take advantage of the mechanisms they provide by using the stock solutions, and not inventing complex solutions on your own. You may be a step before them today, but in two years the browsers will catch up and your site will suck.

Do you use particular conventions for naming complementary variables?

I often find myself trying to come up with good names for complementary pairs of variables; where two variables denote opposing concepts, two participants in some sort of duologue, and so on.
This might be better explained by a counter-example - I maintain an app that prints two graphics as part of a print advertisement. They're stored in the database as TopLogo and LowerLogo, which I have to stop and double-check every time I use them because I'm expecting top to complement bottom, and lower should complement upper.
There's some obvious examples that I think work well:
client / server
source / target for copying/moving data or files from one variable to another
minimum / maximum
but there's some concepts that just don't lend themselves to such neat naming schemes. For example, when paging through records, does 'last' mean 'final' or 'previous' ? I recently saw some code that used firstPage, previousPage, nextPage and finalPage to avoid the ambiuous lastPage completely, which I thought was very beat, hence this question.
Do you have any particularly neat variable name pairs you'd care to share with us? (Bonus points if they're the same length, which makes the code so much neater in monospaced fonts.)
Like with all kinds of code style conventions, consistency is what you should strive for.
I would have the development team agree on "standard" pairs of prefixes for common scenarios like "source/destination" or "from/to" and then stick with them for the whole project. As long as every developer is aware of what is meant with a particular prefix in the codebase, it is easier to avoid misunderstandings.
Exceptions to the rule should be clarified in the documentation if the variable is part of a public API, or in comments within the code, if it's visibility is restricted to a single class or method.
In my databases you'll find many valid-state temporal ("history") tables containing a pair of columns named start_date and end_date. No bonus points for me, then, because I'd rather use the commonly used 'end' than try to come up with an intuitive alternative with the same number of characters as the word 'start'.
I tend to prefer these generic terms even when more context-specific terms may be viable e.g. preferring employee_start_date over employee_hire_date (what if their employment started for a reason other than being formally hiring e.g. their company was the subject of an acquisition). That said, I'd prefer person_birth_date over person_start_date :)
While one does try to be semantically coherent in obvious cases -- e.g., maximum goes with minimum, and not "lowest" -- in well-structured OO code (which isn't all code, I know) the problem disappears with a good IDE. Classes are short, methods are short, and variables are few in each method. So it doesn't matter what you call the variable pairs so long as they're clear. Your code might not look professional, but real quality is in the code, not in the look of your code.
The problem further disappears if there is good JavaDoc or whatever the documentation system is, and if have good Class names that go with them. For instance, if you have an instance of a Connection class and it has a method a method called setDestination, that's okay, but if you know that setDestination takes one parameter called destination and it's of the Server class, you're cool... even though you might prefer to call it target, aimHere, placeToSendTheData, or whatever (and the corresponding names, source, comingFromHere, and placeToGetTheDataFrom). Plus the doc system says what the thing is for, and that is priceless.
This next thing might sound stupid and I'm sure I'll get voted down here on StackOverflow, but unique non-professional sounding variable names have a great advantage: I know that my variables have names like placeWeWantTheDataToGo (and the IDE takes care of typing it), but the "serious" guys who do the JDK would never use such silly names. So I know immediately that the variable is one of mine. Incidentally, when I worked with developers in Spain and Italy, they write code with Spanish variable names (not always, but usually). This causes the same effect: we can quickly see that the Conexion class is ours, but the Connection class is not.
[Also, instead of typing your variable names, assign them a constant String somewhere in your code and use that, so if they called it lower or downer instead of low, you're still okay.]
Yes, I do try to name complementary sets of variables systematically so that the symmetry is clear. It is not always easy; sometimes, not even possible. Well, not possible using the rules I lay down for myself - which means I usually try to have the names the same length. The 'top' and 'lower' example would drive me batty (assuming I'm not batty already, which is far from certain); I'd probably use 'upper' and 'lower' because those are the same length; 'top' and 'bottom' would frustrate me too because of the difference in length.

What are important points when designing a (binary) file format? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:
have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
possibly reserve some space for further per-file attributes that might be necessary in the future?
What else would be useful to make the format more future-proof and minimize headache in the future?
Take a look at the PNG spec. This format has some very good rationale behind it.
Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.
As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.
I agree that these are good ideas:
Magic numbers at the beginning. Pretty much required in *nix:
File version number for backwards compatibility.
Endianness specification.
But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).
possibly reserve some space for further per-file attributes that might be necessary in the future?
Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.
In addition to 1-3 above, I'd add these:
simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.
version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.
Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).
And here are some optional ones:
If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.
If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.
It all depends on the purpose of the format, of course.
One flexible approach is to structure entire file as TLV (Tag-Length-Value) triplets.
For example, make your file comprized of records, each record beginning with a 4-byte header:
1 byte = record type
3 bytes = record length
followed by record content
Regarding the endianness, if you store endianness indicator in the file, all your applications will have to support all endianness formats. On the other hand, if you specify a particular endianness for your files, only applications on platforms with non-matching endiannes will have to do additional work, and it can be decided at compile time (using conditional compilation).
Another point, taken from .xz file spec (http://tukaani.org/xz/xz-file-format.txt): one of the first few bytes should be a non-character, "to prevent applications from misdetecting the file as a text file.". Note sure how many header bytes are usually inspected by editors and other tools, but using a non-binary byte in the first four or eight bytes seems useful.
One of the most important things to know before even starting is how your file will be used.
Will random or sequential access be the norm?
How often will the data be read?
How often will the data be written?
Will you write out the file in one go or will you be slowing writing it as data comes in.
Will the file need to be portable? Not all formats need to be.
Does it need to be compatible with other versions? Maybe updating the file is sufficient.
Does it need to be easy to read/write?
Size/Speed/Compexity tradeoff.
Most answers here give good advise on the portability/compatibility front so I am not going to add more. But consider the following (often overlooked) things.
Some files are often written and rarely read (backups, logs, ...) and you may want to focus on filesize and easy-writing.
Converting endianness is slow (relatively) if your file will never leave the host, or leaves rarely enough that conversion is a good option you can get a significant performance boost. Consider writing a number such as 0x1234 as part of the header so that you can detect (and instruct the user to convert) if this is the case.
Sometimes easy reading is really useful. If you are doing logs or text documents, consider compressing all in one go rather than per-entry so that you can zcat | strings the file and see what is inside.
There are many things to keep in mind and designing a good format takes a lot of planning and foresight. The little things such as zcating a file and getting useful information or the small performance boost from using native integers can give your product an edge, however you need to be careful that you don't sacrifice something important to get it.
One way to future proof the file would be to provide for blocks. Straight after your file header data, you can begin the first block. The block could have a byte or word code for the type of block, then a size in bytes. Now you can arbitrarily add new block types, and you can skip to the end of a block.
I would consider defining a substructure that higher levels use to store data, a little like a mini file system inside the file.
For example, even though your file format is going to store application-specific data, I would consider defining records / streams etc. inside the file in such a way that application-agnostic code is able to understand the layout of the file, but not of course understand the opaque payloads.
Let's get a little more concrete. Consider the usual ways of storing data in memory: generally they can be boiled down to either contiguous expandable arrays / lists, pointer/reference-based graphs, and binary blobs of data in particular formats.
Thus, it may be fruitful to define the binary file format along similar lines. Use record headers which indicate the length and composition of the following data, whether it's in the form of an array (a list of identically-typed records), references (offsets to other records in the file), or data blobs (e.g. string data in a particular encoding, but not containing any references).
If carefully designed, this can permit the file format to be used not just for persisting data in and out all in one go, but on an incremental, as-needed basis. If the substructure is properly designed, it can be application agnostic yet still permit e.g. a garbage collection application to be written, which understands the blobs, arrays and reference record types, and is able to trace through the file and eliminate unused records (i.e. records that are no longer pointed to).
That's just one idea. Other places to look for ideas are in general file system designs, or relational database physical storage strategies.
Of course, depending on your requirements, this may be overkill. You may simply be after a binary format for persisting in-memory data, in which case an approach to consider is tagged records.
In this approach, every piece of data is prefixed with a tag. The tag indicates the type of the immediately following data, and possibly its length and name. Lists may be suffixed with an "end-list" tag that has no payload. The tag may have an embedded identifier, so tags that aren't understood can be ignored by the serialization mechanism when it's reading things in. It's a bit like XML in this respect, except using binary idioms instead.
Actually, XML is a good place to look for long-term longevity of a file format. Look at its namespacing capabilities. If you construct your reading and writing code carefully, it ought to be possible to write applications that preserve the location and content of tagged (recursively) data they don't understand, possibly because it's been written by a later version of the same application.
Make sure that you reserve a tag code (or better yet reserve a bit in each tag) that specifies a deleted/free block/chunk.
Blocks can then be deleted by simply changing a block's current tag code to the deleted tag code or set the tag's deleted bit.
This way you don't need to right away completely restructure your file when you delete a block.
Reserving a bit in the tag provides the the option of possibly undeleting the block
(if you leave the block's data unchanged).
For security, however you might want to zero out the deleted block's data, in this case you would use a special deleted/free tag.
I agree with Stepan, that you should choose an endianess, but I would also have an endianess indicator in the file.
If you use an endianess indicator you might consider using
one of the UniCode Byte Order Marks also as an inidicator of any UniCode text encoding used for any text blocks. The BOM is usually the first few bytes of UniCoded text files, so if your BOM is the first entry in your file there might be a problem of some utility identifying your file as UniCode text (I don't think this is much an issue).
I would treat/reserve the BOM as one of your normal tags (using either the UTF16 BOM if using the 16bit tags or the UTF32 BOM if using 32bit tags) with a 0 length block/chunk.
See also http://en.wikipedia.org/wiki/File_format
I agree with atzz's suggestion of using a Tag Length Value system. For future compatibility, you could store a set of "pointers" to TLV entries at the start (or maybe Tag,Pointer and have the pointer point to a Length,Value; or perhaps Tag,Length,Pointer and then have all the data together elsewhere?).
So, my file could look something like:
magic number/file id
version
tag for first data entry
pointer to first data entry --------+
tag for second data entry |
pointer to second data entry |
... |
length of first data entry <--------+
value for first data entry
...
magic number, version, tags, pointers and lengths would all be a predefined set length, for easy decoding. Say, 2 bytes. Or 4, depending on what you need. They don't all need to be the same (eg, all tags are 1 byte, pointers are 4 etc).
The tag lets you know what is being stored. The pointer tells you where (either an offset or absolute value, in bytes), the length tells you how large the data is, and the value is length bytes of data of type tag.
If you use a MyFileFormat v1 decoder on a MyFileFormat v2 file, the pointers allow you to skip sections which the v1 decoder doesn't understand. If you simply skip invalid tags, you can probably simply use TLV instead of TPLV.
I would either hand code something like that, or maybe define my format in ASN.1 and generate a codec (I work in telecommunications, so ASN.1/TLV makes sense to me :-D)
If you're dealing with variable-length data, it's much more efficient to use pointers: Have an array of pointers to your data, ideally near the start of the file, rather than storing the data in an array directly.
Indirection is preferrable in this instance because it allows random access, which is only possible if all items are the same size. If the data was directly stored in an array, without specifying the locations of any records, data access would take O(n) time in the worst case; in order for your file-reading code to access a particular element it would have to know the length of all previous elements, and the only way to find that out is to look at each one. If you're reading the entire file at once, then you'd be doing this anyway, so it wouldn't be a problem. But if you only want one thing, then this isn't the way to go.
Whereas with an array of pointers, it's O(1) time all around: all you need is an index number, and you can retrieve and follow the pointer to get at your data.
When writing a file using this method, you would of course have to build up your table in memory before doing any writing.