Resources on generating equivalent phrases (same language translation)? - language-agnostic

I'm interested in building a program that takes some text (an article, for example) and then generates a new text with equivalent meaning, but I'm not sure how to get started on such a problem.
Can anyone recommend some code/books/papers/techniques that would help me tackle this?

Did you check this one:
fastsubs: a program to generate most likely substitutes for words in a given text based on an n-gram language model.
More info is available at:
http://denizyuret.blogspot.be/2012/05/fastsubs-efficient-admissible-algorithm.html

Related

Extracting 'useful' information out of sentences?

I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.
Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.
Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce
Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.
Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/
There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.
Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.
I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

PHP/Python/C/C++ library/application to match/correct/give suggestions to input

I'd like to have a simple & lightweight library/application in PHP/Python/C/C++ library/application to match/correct/give suggestions to input. Example in/out:
Input: Webdevelopment ==> Output: Web Development
Input: Web developmen ==> Output: Web Development
Input: Web develop ==> Output: Web Development
Given there is database of correct words and phrases, I just need the library to match/guess phrases. Please suggest if you know any.
How to Write a Spelling Corrector from Google's Director of Resarch Peter Norvik contains a spelling corrector in 21 lines of Python, complete with explanations.
You will have to convert this into a module yourself, but that should be easy. Of course, you will also need a corpus (i.e. words), but he gives sources for these as well.
I guess what you want to do is compute the edit distance between strings (an input, output pair).
One of the simpler ones (that I've used for figuring out a team's full name from it's 3 letter short one - it's a long story..) is the Levenshtein distance. The last external link on the page has a bunch of different implementations of it (turns out it's standard on PHP 4.0.1+).

Improving the way we write code?

While thinking about software-engineering in general I came across the question why we don't see any improvements in the way we write/document code.
Think about it: There has not been a revolutionary improvement since we've moved from punch cards to text editing. The last improvement I've seen is syntax highlighting and context sensitive help (e.g. Intellisense or ctags). Not something I would call revolutionary.
That makes me wonder: Why is it so?
I'll start with something I miss badly:
Lots of my code deals with geometry.
For documentation describing geometric relationships always ends up in a big heap of hard to read mathematical stuff (due to the lack of proper equation type-setting in ASCII). However, if I could embed a little drawing or scribble into the code everything would be much easier, neater and better to be understood.
What can you think up that would make your coding/text editing/documention tasks easier?
I'm surprised that nobody has yet mentioned No Silver Bullet. In 1986 (!), Frederick Brooks predicted that:
There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity. [...] We cannot expect ever to see two-fold gains every two years."
And in 23 years, he's been proven right. We've come up with a number of things such as syntax highlighting and Intellisense which have improved productivity significantly, but certainly not by an order of magnitude. As time marches on, we'll continue to make several incremental improvements, but the fact is there is no silver bullet: there's not going to be some magical revelation in the way we write code that will improve productivity by an order of magnitude.
I'm suprised that no one seems to have mentioned Donald Knuth's seminal Literate Programming - write your code as if it were a book or a scientific paper.
There has not been a revolutionary improvement since we've moved from punch cards to text editing
Never used a line editor, have you?
But seriously, text (especially in the representations chosen for modern languages) is
easily processed
fairly easy to specify
information dense
precise
Anything that comes along to replace it has to be a net win across all four of those properties. Not easy.
I disagree. We do have changes, small, but changes.
How common is the "for each" construct? Compare it to 20 years ago. How about the Domain Specific Languages movement? What about the idea that we should code in layers? How about Behavior Driven Development? Coding by complying to a specification... which writes a nice document as output when all runs fine. How about the standarization of regular expressions? PCRE. What about Alan Kay's group's DSL related work on "Moore's Law for Software", which explored a more advanced implementation of Cairo and generated TCP/IP code using diagrams from RFCs?
Documentation is a two way dialog. Both as code being more understandable and people learning this special language. You wouldn't say that German needs documentation if you know German. I know natural languages are very far away from computer languages, but there's a movement to make code more expressive. It's not about the new tools, it's about how we are coding.
One thing I've done recently in some of the more math-heavy sections of my application is to include the LaTeX markup for the particular equation as a comment/docstring. Right now, I just copy-paste into an online equation editor, but it would be very helpful to see the formula itself (with things like Greek letters and sub/superscripts) rather than a bunch of ASCII code.
Source Code In Database. In a nutshell, source code is parsed and put into a database. You'd then need an integrated IDE to view and edit the code, but at this point, syntax is decoupled from format. YOUR IDE could show you a program in a way that's completely different from someone else's, tuned to the task you're working on. I'd list some specific examples, but that article covers pretty much everything.
I'm surprised nobody mentioned it - javadoc is basically HTML, so there's nothing preventing you from embedding images (or anything else) in code. Simple, effective and ubiquitous, it's one of the things Java did right.
DrScheme let's you do these things. Here's the things you can insert from the PLT website:
http://docs.plt-scheme.org/drscheme/Menus.html#(part._.Insert)
3.1.6 Insert
Insert Comment Box : Inserts a box that is ignored by DrScheme; use it to write comments for people who read your program.
Insert Image... : Opens a find-file dialog for selecting an image file in GIF, BMP, XBM, XPM, PNG, or JPG format. The image is treated as a value.
Insert Fraction... : Opens a dialog for a mixed-notation fraction, and inserts the given fraction into the current editor.
Insert Large Letters... : Opens a dialog for a line of text, and inserts a large version of the text (using semicolons and spaces).
Insert λ : Inserts the symbol λ (as a Unicode character) into the program. The λ symbol is normally bound the same as lambda.
Insert Java Comment Box : Inserts a box that is ignored by DrScheme. Unlike the Insert Comment Box menu item, this is designed for the ProfessorJ language levels. See ProfessorJ.
Insert Java Interactions Box : Inserts a box that will allows Java expressions and statements within Scheme programs. The result of the box is a Scheme value corresponding to the result(s) of the Java expressions. At this time, Scheme values cannot enter the box. The box will accept one Java statement or expression per line.
Insert XML Box : Inserts an XML; see XML Boxes and Scheme Boxes for more information.
Insert Scheme Box : Inserts a box to contain Scheme code, typically used inside an XML box; see XML Boxes and Scheme Boxes.
Insert Scheme Splice Box : Inserts a box to contain Scheme code, typically used inside an XML box; see also XML Boxes and Scheme Boxes.
Insert Pict Box : Creates a box for generating a Slideshow picture. Inside the pict box, insert and arrange Scheme boxes that produce picture values.
You also insert your unit tests with the code that you're testing. Pretty neat stuff.
I think integrated IDEs with semantic highlighting and **semantically-constrained suggestions* (a la IDEA or Eclipse) are a huge advancement.
But that happened 8-10 years ago.
Template-based programming feels useful never seems to catch on. Recently I was impressed with a demo of the Meta-programming system, which leverages the interactive nature of the IDE to simplify the task of writing templates and what are (essentially) type-aware macros.
Meta-programming might help you define geometry-based macros that would substitute for a number of lines of code. I could imagine something that let you embed a more-readable 'math language' inside Java, and then parses its contents into something machine-readable.
I'd say version control was a pretty huge leap in how we work. The ability to keep a full record of every change anyone has made to the codebase, and to revert changes where necessary, has made a big difference.
I certainly respect Fred Brooks' argument No Silver Bullet, but I think the way we write code is nowhere near optimal, so there is lots more room for improvement. I tried to explain this in my book.
We're all familiar with "code golf", where you compete relentlessly to minimize something. That is a good way to approach the minimum possible value of that something.
What's great about this is that you are allowed, even encouraged, to break from traditions, prior conceptions, accepted wisdom, in the quest for winning. In short, you learn new things.
If the measure to be minimized is wall-clock execution time, you can do aggressive optimization.
If the measure is source code size (lines or characters) you get "code golf".
The measure I like best is "edit count". That is, given a code base, suppose a new requirement comes along. That requirement is implemented, completely, by editing the code base. Then a "diff" is done from old to new code base. The number of differences found is the edit count. Averaged over the set of likely new functional requirements, that is the measure to minimize.
If this is done aggressively, being free to contradict all conventional wisdom, the code base approaches a state I would call a domain-specific language (DSL). In this language, concepts expressed in code are in nearly 1-1 correspondence with problem-oriented concepts. In this state, it is not easy for the source code to be self-inconsistent (i.e. have bugs) because the fewer edits that have to be made to the source code, the fewer chances there are to make a mistake. It's also the case that such code tends to be short. But unlike "code golf" it tends to be very clear, because it maps the problem concepts so clearly.
So, tools and techniques that help in minimizing edit count can, in my opinion, be considered "silver bullets". DSL is one such. Code generation is another. My favorite optimization technique is another. For coding dynamically changing UIs there is differential execution. There are bound to be more, waiting to be discovered. Of course, everything depends on the training and experience of the "marksman" (the coder).
I think there are lots of new ideas to be discovered. The trick is to tell the difference between the ones that move us forward, versus the ones that hold us back.
I think this is where Doxygen and other documentation systems help. If we can embed small, discrete comments that link to other information such as:
/* help: fooimg.png */
And then have an external documentation system do that, then great.
Even better would be allowing our text-editor to treat those things as hyperlinks to external documentation.
I would reference a drawing as a reference in the code documentation. I see no reason why you can't have footnotes in code.
The ability to make a section of code read-only is something I've wanted
It sounds like you might be interested in Jonathan Edward's research. See, for example:
"The Summer of Code"
"What's next?"
"The future of programming"
Diffing and searching pictures is hard. Diff and search are very important to programmers. Using pictures instead of text is only a marginal improvement in many situations, it has some drawbacks, and it requires general acceptance before it's really worth doing (since you don't make things more understandable if your reader doesn't grok what you've done).
Plus, programmers have a million little tricks that make their lives easier, based on text representations of code, that they'd lose if you gave them code to read that was expressed in anything other than text. Sure, they might replace or re-implement those tricks over time, but in the short term they're gone.
You don't see lawyers switching from English to little back-of-a-napkin diagrams in contracts, either (the Creative Commons licenses try, but cannot make the picture be the formal representation of the contract). Probably for similar reasons.
If someone comes up with a programming language and IDE that, on balance, beats text-based ones; and successfully markets it; then you'll see the start of a revolutionary shift from text to a new format. If nobody comes up with any such thing, then we're not missing out. If someone comes up with something that is more productive but it doesn't gain traction because of independent advantages of other technologies, then that loss is the price we pay for free-market capitalism. Perhaps the ideas will be recycled eventually...
That said, integration between code and documentation could clearly be improved, and there are many efforts underway to do so, using various techniques with varying success. Again, the problem is that any particular cunning plan can in practice only really be implemented in one or a few languages and development environments at a time, and so has difficulty proving that it really is better. Embedding documentation in code is possibly the only universal advance since the invention of the API...
I think there's still a lot that can be done with text, though. For example, debugger technology makes a big difference to programmer productivity in certain common circumstances (namely: when a test fails or something else unexpected happens, but it's not obvious what the faulty assumption is in the code you're looking at). There may be lower-hanging fruit in terms of making programming better, than the actual business of expressing the program.
The last improvement I've seen is
syntax highlighting and context
sensitive help
Then you haven't looked much. Modern IDEs can do far, FAR more than that, namely show you the semantic structure of code (e.g. inheritance hierarchies) and even manipulate it (automatic refactoring) or enrich it with external data (such as who last changed a particular line of code).
I've used emacs, I like text macros. But, what I really want is parse macros. I'd like my editor to expose the machinery behind refactoring in such a way that I can write my transformations on the parse tree of the language itself.
For example, Python added += at one point when my code was littered with x = x + 1 lines. If I could have written a search and replace command that worked on the parse tree, I could have quickly cleaned up large amounts of my source code.
So, I want standard search and replace, but I want it at the level where the structure of my code has meaning, at the abstract syntax tree.
If you've ever used ReSharper, each of its refactorings and recommendations are written in the manner I describe, they find a pattern in the parse tree and suggest a replacement, or for a refactoring, apply a known replacement. I want access to that machinery for my own tasks!
Have you used Doxygen or similar for documenting your code? You can add links to images, and other file types (often stored in same directory as source code) that will get sucked into the generated documentation. I realize that this is one step removed from seeing the detail directly in you favorite editor but it definitely improves how we document our code.
Programming languages are a specialized form of mathematical notation, since you can express a programming language mathematically. Notation changes slowly, and so we don't get fast progress in our languages. Mostly, we advance when we come up with a new thing to fit into the notation, like using i to refer to the square root of negative one.
There are documentation schemes that allow you to embed things other than text. There was at least one programming scheme, Donald Knuth's Web, that allowed you to have a presentation and an execution version of a program (unfortunately, the base source code, the stuff you'd actually hack, was rather messy).
You could easily have a text editor that could treat comments as HTML, provided of course it could recognize comments as it saw them.
I've been thinking a lot about how to make coding faster and more efficient for the past years, always trying to keep it realistic and doing minimalistic implementations. Those are not revolutionary ideas, but since the original poster talked about the punching card to code typing transition, I thought of talking about other ways of communicating to the computer what we want to program.
My ideas are visual or vocal programming. The motivation behind is that there are only a number of ways a loop can be efficiently programmed, and an aware IDE could make some smart code substitution decisions depending on inputs other than typed lines of code.
Visual programming vs Coding: encapsulate (literally) code into "boxes" which have inputs and outputs, and connect them together across a horizontal timeline. This is a high-level concept that would be intrinsically interesting for multithreading development since you can have multiple lines or threads happening at the same time. Every process can be divided into a "box", no matter how you see it. Sending an e-mail in its most basic form is a box which takes an email as input and outputs a success/fail signal. Since the boxes and the lines are distributed across a timeline, the notion of time and event chronology isn't lost and feedback lines are possible.
Vocal programming vs Coding: The effectiveness of this technique would revolve around the effectiveness of the vocal syntax decided to create code and move the cursor. For example, you can say to the microphone "for variable zero to 10" and the system will automatically generate the following code placing the cursor inside:
for (x=0;x<10;x++){
// Cursor would be there after after the call
}
In terms of usability, you would need to be in a relatively silent room to minimize other sounds that might harm the voice recognition so this technology could be used in specialized environments mostly.
This is the result of my extensive programming experience using a wide range of hardware and programming languages. Let me know what you guys think, I'd love having a constructive discussion about that.
A few weeks back the "Intentional Software" created quite a buzz about their new language. I've yet to watch the presentation, but here is a quote from a review by Martin Fowler:
They started worryingly, with the
usual unrevealing Powerpoints, but
then they switched to showing the
workbench and the curtain finally
opened. To gauge the reaction, take a
look at Twitter.
#pandemonial Quite impressed! This is sweet! Multiple domains, multiple
langs, no question is going unanswered
#csells OK, watching a live electrical circuit rendered and
working in a C# file is pretty damn
cool.
#jolson Two words to say about the Electronics demo for Intentional
Software: HOLY CRAPOLA. That's it, my
brain has finally exploded.
#gblock This is not about snazzy demos, this is about completely
changing the world we know it.
#twleung ok, the intellisense for the actuarial formulas is just awesome
#lobrien This is like seeing a 100-mpg carburetor : OMG someone is
going to buy this and put it in a
vault!
Two quotes come instantly to mind:
"If it ain't broke, don't fix it."
"Use the best tool for the job."
Of course, although the core code is still written as text, alll the tools and libraries have changed massively since the days of punched cards.
This has been touched on by others, and it wouldn't revolutionise programming, but anyway...
I think it would be nice if code editors moved slightly beyond plain text editors. Even with syntax highlighting and code completion (which I think are incredibly good things), the editors of today (at least, the ones I use) still display exactly the same ASCII text (or whatever encoding is used) that is in the source files. I would be interested to see how well it would work if editors displayed, for example (some examples are more adventurous than others):
Comments in a text box with a light-blue background and no // or /* ... */ visible
Javadoc comments could have semi-rich text editing support (for those who do HTML Javadoc comments) (seriously, I would appreciate it if code editors rendered Javadoc comments as HTML because they're not the easiest to skim over when their HTML as plain text)
Functions in text boxes that could be collapsed to show only the signature (the collapsing can be done by current editors) and can be dragged around as boxes
Lines between function boxes to indicate how functions are connected
Zooming out so that rather than seeing a single source file (class in many languages) you can see multiple files and the way connect to each other (this would essentially be building UML-like diagramming directly into the code editor)
I think this (in my mind at least) would work without requiring additional markup in the source files so users of plain text editors wouldn't be disadvantaged by having all this extra markup cluttering the files.
Part of the problem might stem from the fact that when you don't code we don't call it programming: Assembling modular components using a GUI for instance.
You might be interested in these alternative programming "languages".
[Ladder][1], designed to mimic the way relay-logic-schemes work. Horrible IMO, but easy to understand for the old guys who did logic with sticks and stones. [http://www.amci.com/tutorials/images/ladder-diagram.gif][2]
[SFC, Sequential function chart][3], designed to simplify parallell programming. Code is written into boxes and these boxes can be placed paralell to each other and will thus execute simultaneously. By connecting the end of several boxes you can syncronize events. Very common for automation applications.
[Mathematica][5]!!!, Might not be the best programming language but the syntax highlighting(if you can call it that) is awesome! For example you can input a matrix by seeing the matrix nicly aligned instead of a huge double[][]. Graphs can be inserted in the code and the formatting of mathematical expressions looks like it does when you write on a paper. No more paranthesis-madness or long Math.PI expressions that really only need one character. And best of all, the files are just plain text even if it is rendered nicely in the editor!
Debuggers is also an area where lots of improvement has been done. Debuggers with replay are starting to come and also visual debuggers where data can be modified in real time. Edit and continiue is also a feature i wouldn't want to live withot.
WTF "new users can only post a maximum of one hyperlink", you will have to google the stuff i originally added to this post >:(
A brain-to-computer translator. Typing is the real
bottleneck. It really just needs to derive the algorithms I
think up and convert that to machine code.
I would say a lot of the newer languages are pretty great at
quickly creating algorithms. The improvements aren't so much
revolutionary now as they are evolutionary.
Dare I say it might actually be a new development language (perhaps even a new paradigm) to take us through such revolution;
I think you might want to take a look at Leo. This is one guys attempt at answering what you're asking about. I still can't wrap my VIM head around it personally, but others take to it quickly. It's not just a programming IDE, but more of an information organizer. It's written in Python, but I don't see why you can't code in other languages with it. The power of Leo is not so much the language, but the ability to express your thoughts and organize them whether it be in code, diagrams, images, or diagrams. Look over the tutorial and examples to get a feel for it. You might like it.
Automated semantic source code transformations, where a program can be reliably examined and manipulated by using an abstract interface/frontend to it that is aware of the underlying semantics.
So that source code can be queried and dealt with pretty much like a SQL database.
Allowing you to do static analysis of source code and refactor even complex source code by doing something along the lines of:
FIND CALLERS OF FUNCTION "foo" WHERE SIGNATURE("int","int","char*") AND RETURN_TYPE("bool");
...
RENAME MACRO "max" TO "maximum" IN FILE "macros.hxx";
RENAME NAMESPACE "prj" TO "project";
RENAME SYMBOL "OLDFOO" IN NAMESPACE "project";
RENAME FUNCTION "log" TO "show_log";
RENAME CLASS "FOO" TO "OLDFOO";
RENAME METHOD "FOO::inc" TO "FOO::increment";
...
CHANGE SIGNATURE IN FUNCTION "foo" WHERE SIGNATURE("int","int") TO SIGNATURE("double","double");
CHANGE SIGNATURE IN METHOD "myClass::handle" WHERE SIGNATURE("char") TO SIGNATURE("unsigned char")
MOVE FUNCTION "foo" in FILE "stuff.cc" TO "foo_funcs.cc";

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.