Internationalization in your projects - language-agnostic

How have you implemented Internationalization (i18n) in actual projects you've worked on?
I took an interest in making software cross-cultural after I read the famous post by Joel, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However, I have yet to able to take advantage of this in a real project, besides making sure I used Unicode strings where possible. But making all your strings Unicode and ensuring you understand what encoding everything you work with is in is just the tip of the i18n iceberg.
Everything I have worked on to date has been for use by a controlled set of US English speaking people, or i18n just wasn't something we had time to work on before pushing the project live. So I am looking for any tips or war stories people have about making software more localized in real world projects.

It has been a while, so this is not comprehensive.
Character Sets
Unicode is great, but you can't get away with ignoring other character sets. The default character set on Windows XP (English) is Cp1252. On the web, you don't know what a browser will send you (though hopefully your container will handle most of this). And don't be surprised when there are bugs in whatever implementation you are using. Character sets can have interesting interactions with filenames when they move to between machines.
Translating Strings
Translators are, generally speaking, not coders. If you send a source file to a translator, they will break it. Strings should be extracted to resource files (e.g. properties files in Java or resource DLLs in Visual C++). Translators should be given files that are difficult to break and tools that don't let them break them.
Translators do not know where strings come from in a product. It is difficult to translate a string without context. If you do not provide guidance, the quality of the translation will suffer.
While on the subject of context, you may see the same string "foo" crop up in multiple times and think it would be more efficient to have all instances in the UI point to the same resource. This is a bad idea. Words may be very context-sensitive in some languages.
Translating strings costs money. If you release a new version of a product, it makes sense to recover the old versions. Have tools to recover strings from your old resource files.
String concatenation and manual manipulation of strings should be minimized. Use the format functions where applicable.
Translators need to be able to modify hotkeys. Ctrl+P is print in English; the Germans use Ctrl+D.
If you have a translation process that requires someone to manually cut and paste strings at any time, you are asking for trouble.
Dates, Times, Calendars, Currency, Number Formats, Time Zones
These can all vary from country to country. A comma may be used to denote decimal places. Times may be in 24hour notation. Not everyone uses the Gregorian calendar. You need to be unambiguous, too. If you take care to display dates as MM/DD/YYYY for the USA and DD/MM/YYYY for the UK on your website, the dates are ambiguous unless the user knows you've done it.
Especially Currency
The Locale functions provided in the class libraries will give you the local currency symbol, but you can't just stick a pound (sterling) or euro symbol in front of a value that gives a price in dollars.
User Interfaces
Layout should be dynamic. Not only are strings likely to double in length on translation, the entire UI may need to be inverted (Hebrew; Arabic) so that the controls run from right to left. And that is before we get to Asia.
Testing Prior To Translation
Use static analysis of your code to locate problems. At a bare minimum, leverage the tools built into your IDE. (Eclipse users can go to Window > Preferences > Java > Compiler > Errors/Warnings and check for non-externalised strings.)
Smoke test by simulating translation. It isn't difficult to parse a resource file and replace strings with a pseudo-translated version that doubles the length and inserts funky characters. You don't have to speak a language to use a foreign operating system. Modern systems should let you log in as a foreign user with translated strings and foreign locale. If you are familiar with your OS, you can figure out what does what without knowing a single word of the language.
Keyboard maps and character set references are very useful.
Virtualisation would be very useful here.
Non-technical Issues
Sometimes you have to be sensitive to cultural differences (offence or incomprehension may result). A mistake you often see is the use of flags as a visual cue choosing a website language or geography. Unless you want your software to declare sides in global politics, this is a bad idea. If you were French and offered the option for English with St. George's flag (the flag of England is a red cross on a white field), this might result in confusion for many English speakers - assume similar issues will arise with foreign languages and countries. Icons need to be vetted for cultural relevance. What does a thumbs-up or a green tick mean? Language should be relatively neutral - addressing users in a particular manner may be acceptable in one region, but considered rude in another.
Resources
C++ and Java programmers may find the ICU website useful: http://www.icu-project.org/

Some fun things:
Having a PHP and MySQL Application that works well with German and French, but now needs to support Russian and Chinese. I think I move this over to .net, as PHP's Unicode support is - in my opinion - not really good. Sure, juggling around with utf8_de/encode or the mbstring-functions is fun. Almost as fun as having Freddy Krüger visit you at night...
Realizing that some languages are a LOT more Verbose than others. German is a LOT more verbose than English usually, and seeing how the German Version destroys the User Interface because too little space was allocated was not fun. Some products gained some fame for their creative ways to work around that, with Oblivion's "Schw.Tr.d.Le.En.W." being memorable :-)
Playing around with date formats, woohoo! Yes, there ARE actually people in the world who use date formats where the day goes in the middle. Sooooo much fun trying to find out what 07/02/2008 is supposed to mean, just because some users might believe it could be July 2... But then again, you guys over the pond may believe the same about users who put the month in the middle :-P, especially because in English, July 2 sounds a lot better than 2nd of July, something that does not neccessarily apply to other languages (i.e. in German, you would never say Juli 2 but always Zweiter Juli). I use 2008-02-07 whenever possible. It's clear that it means February 7 and it sorts properly, but dd/mm vs. mm/dd can be a really tricky problem.
Anoter fun thing, Number formats! 10.000,50 vs 10,000.50 vs. 10 000,50 vs. 10'000,50... This is my biggest nightmare right now, having to support a multi-cultural environent but not having any way to reliably know what number format the user will use.
Formal or Informal. In some language, there are two ways to address people, a formal way and a more informal way. In English, you just say "You", but in German you have to decide between the formal "Sie" and the informal "Du", same for French Tu/Vous. It's usually a safe bet to choose the formal way, but this is easily overlooked.
Calendars. In Europe, the first day of the Week is Monday, whereas in the US it's Sunday. Calendar Widgets are nice. Showing a Calendar with Sunday on the left and Saturday on the right to a European user is not so nice, it confuses them.

I worked on a project for my previous employer that used .NET, and there was a built in .resx format we used. We basically had a file that had all translations in the .resx file, and then multiple files with different translations. The consequence of this is that you have to be very diligent about ensuring that all strings visible in the application are stored in the .resx, and anytime one is changed you have to update all languages you support.
If you get lazy and don't notify the people in charge of translations, or you embed strings without going through your localization system, it will be a nightmare to try and fix it later. Similarly, if localization is an afterthought, it will be very difficult to put in place. Bottom line, if you don't have all visible strings stored externally in a standard place, it will be very difficult to find all that need to be localized.
One other note, very strictly avoid concatenating visible strings directly, such as
String message = "The " + item + " is on sale!";
Instead, you must use something like
String message = String.Format("The {0} is on sale!", item);
The reason for this is that different languages often order the words differently, and concatenating strings directly will need a new build to fix, but if you used some kind of string replacement mechanism like above, you can modify your .resx file (or whatever localization files you use) for the specific language that needs to reorder the words.

I was just listening to a Podcast from Scott Hanselman this morning, where he talks about internationalization, especially the really tricky things, like Turkish (with it's four i's) and Thai. Also, Jeff Atwood had a post:

Besides all the previous tips, remember that i18n it's not just about changing words for their equivalent on other languages, especially for non-latin languages alphabets (korean, Arabic) which written right to left, so the whole UI will have to conform, like
item 1
item 2
item 3
would have to be
arabic text 1 -
arabic text 2 -
arabic text 3 -
(reversed bullet list doesn't seem to work :P)
which can be a UI nightmare if your system has to apply changes dinamically once the user changes the language being used.
Another very hard thing is to test different languages, not just for the correctness of word, but since languages like Korean usually have bigger font type for their characters this may lead to language specific bugs (like "SAVE" text on a button being larger than the button itself for some language).

One of the funnier things to discover: italics and bold text makrup does not work with CJK (Chinese/Japanese/Korean) characters. They simply become unreadable. (OK, I couldn't really read them before either, but especially bolding just creates ink blots)

I think everyone working in internationalization should be familiar with the Common Locale Data Repository, which is now a sub-project of Unicode:
Common Locale Data Repository
Those folks are working hard to establish a standard resource for all kinds of i18n issues: currency, geographical names, tons of stuff. Any project that's maintaining its own core local data given that this project exists is pretty bonkers, IMHO.

I suggest to use something like 99translations.com to maintain your translations . Otherwise you won't be able to tell what of your translations are up to date in every language.

Another challenge will be accepting input from your users. In many cases, this is eased by the input processing provided by the operating system, such as IME in Windows, which works transparently with common text widgets, but this facility will not be available for every possible need.

One website I use has a translation method the owner calls "wiki + machine translation". This is a community based site so is obviously different to the needs of companies.
http://blog.bookmooch.com/2007/09/23/how-bookmooch-does-its-translations/

One thing no one have mentioned yet is strings with some warying part as in "The unit will arive in 5 days" or "On Monday something happens." where 5 and Monday will change depending on state. It is not a good idea to split those in two and concatenate them. With only one varying part and good documentation you might get away with it, with two varying parts there will be some language that preferes to change the order of them.

Related

Is it safe to use numbers in your web page file names?

Someone recently told me that using numbers in web page file names is not good practice. For example, say I was making a website about Samara Morgan and I had a file named 7days.html - would it be bad to start the file name with a number? Is it riskier than having numbers put later in the file name (ie. day7.html)?
I'm just a tad confused on whether it's generally discouraged to use numbers in file names or not.
EDIT: After asking them to explain a bit more, this is what they said to me:
.... the simplest way I can explain it is that certain programming
languages and operating systems might be confused by putting the
number as the first character. In other words, it has a higher
potential for error, so it's not recommended. That being said, it IS
acceptable to use a number AFTER the first character. By the way, a
domain name (like 4chan.org) is a little different because it's not a
file.
Here are some more tips/best practices (you'll see it as #3):
https://ed.fnal.gov/lincon/tech_web_naming.shtml
I think you need to go back to this someone and ask them for more information - are they saying there's a security problem? a usability problem because of something users might want to do with it? a Search Engine Optimisation trick you're missing that would make it easier for people to find?
I can't actually think of why numbers in URLs would matter for any of these, however. It seems most likely they were thinking of SEO, because that's a constant battle between search engines (who want users to get the results they want) and publishers (who want to get their brand higher up the results) and full of half-understood experiments and dodgy advice.
It's also worth noting that URLs don't exactly have "filenames" at all - they're just a string that the browser sends to the server, and the server may or may not map to a file on disk. Look at the URL of this page, for instance - it contains enough information for the server to look up the right question in a database, plus some human-readable text which is mostly for SEO.
Your server has filenames, of course, but I can't think of any reason why having numbers in those would be a problem, let alone why it would apply particularly to web pages.
Edit based on additional information supplied:
Two things I notice about the link you've added: one, it's twenty years old; two, it includes detailed reasoning for every single point, except point 3. I can't think of any "programming languages and operating systems" that would have a problem with a leading digit. It's actually quite common in some (non-web) contexts, as a way of forcing files to be listed or run in the desired order (e.g. 01-contents.txt, 02-introduction.txt, etc).
I can imagine problems if you began the filename with a ., -, or _, because sometimes there are entrenched conventions that those are hidden, or backups, etc. Either the advice made sense 20 years ago, or the author was being overly conservative to keep the rule simple.
To be precise . Your question refers to whether it is permissible or appropriate to begin the name of a file with an o or more numeric characters .. and according to convenzini on die files used by (main operating system names) this type of naming is allowed and does not present any problem we use it to enterpretazione ..
windows https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
linux https://www.cyberciti.biz/faq/linuxunix-rules-for-naming-file-and-directory-names/
the situation slightly different for the programming languages ​​and the most common case is that of C / C ++ where the use of variables with completely numeric characters or compound nouns that begin with numeric characters can be confusing, and therefore this practice is by some not recommended.
(See this SO for C/C++ vars naming samples and problem Is it safe to use numbers in your web page file names?)
Therefore, in your case that refers to names of files .. the limitations that you have been inidicate are not reflected.
No just keep it like that it doesn't effect anything

Is internationalizing later really more expensive? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Most people would agree that internationalizing an existing app is more expensive than developing an internationalized app from scratch.
Is that really true? Or when you write an internationalized app from scratch the cost of doing I18N is just being amortized over multiple small assignments and nobody feels on his shoulders the whole weight of the internationalization task?
You can even claim that a mature app has many and many LOC that where deleted during the project's history, and that they don't need to be I18Ned if internationalization is made as an after thought, but would have been I18N if the project was internationalized from the very beggining.
So do you think a project starting today, must be internationalized, or can that decision be deferred to the future based on the success (or not) the software enjoys and the geographic distribution of the demand.
I am not talking about the ability to manipulate unicode data. That you have for free in most mainstream languages, databases and libraries. I am talking specifically of supporting your own software's user interface in multiple languages and locales.
"when you write an internationalized app from scratch the cost of doing I18N is ... amortized"
However, that's not the whole story.
Retroactively tracking down every message to the users is -- in some cases -- impossible.
Not hard. Impossible.
Consider this.
theMessage = "Some initial part" + some_function() + "some following part";
You're going to have a terrible time finding all of these kinds of situations. After all, some_function just returns a String. You don't know if it's a database key (never shown to a person) or a message which must be translated. And when it's translated, grammar rules may reveal that a 3-part string concatenation was a dumb idea.
You can't simply GREP every String-valued function as containing a possible I18N message that must be translated. You have to actually read the code, and possibly rewrite the function.
Clearly, when some_function has any complexity to it at all, you're stumped as to why one part of your application is still in Swedish while the rest was successfully I18N'd into other languages. (Not to pick on Swedes in particular, replace this with any language used for development different from final deployment.)
Worse, of course, if you're working in C or C++, you might have some of this split between pre-processor macros and proper C-language syntax.
And in a dynamic language -- where code can be built on the fly -- you'll be paralyzed by a design in which you can't positively identify all the code. While dynamically generating code is a bad idea, it also makes your retroactive I18N job impossible.
I'm going to have to disagree that it costs more to add it to an existing application than from scratch with a new one.
A lot of the time i18n is not required until the application gets 'big'. When you do get big, you will likely have a bigger development team to devote to i18n so it will be less of a burden.
You may not actually need it. A lot of small teams put great effort to support internationalization when you have no customers who require it.
Once you have internationalized, it makes incremental changes more time consuming. It doesn't take a lot of extra time but every time you need to add a string to the product, you need to add it to the bundle first and then add a reference. No it is not a lot of work but it is effort and does take a bit of time.
I prefer to 'cross that bridge when we come to it' and internationalize only when you have a paying customer looking for it.
Yes, internationalizing an existing app is definitely more expensive than developing the app as internationalized from day one. And it's almost never trivial.
For instance
Message = "Do you want to load the " & fileType() & " file?"
cannot be internationalised without some code alterations because many languages have grammatical rules like gender agreement. You often need a different message string for loading every possible file type, unlike in English when it's possible to bolt together substrings.
There are many other issues like this: you need more UI space because some languages need more characters than English to express the same concept, you need bigger fonts for East Asia, you need to use localised date/times in the user interface but perhaps English US when communicating with databases, you need to use semicolon as a delimeter for CSV files, string comparisons and sorting are cultural, phone numbers & addresses...
So do you think a project starting
today, must be internationalized, or
can that decision be deferred to the
future based on the success (or not)
the software enjoys and the geographic
distribution of the demand?
It depends. How likely is the specific project to be internationalised? How important it is to get a first version fast?
If you truly think you get "unicode handling" "for free", you may have a surprise coming your way when you try.
Unless you use a framework that has proven i18n ability beyond languages with the ANSI or very similar character sets, you will find several niggles and more major issues where the unicode handling isn't quite right, or simply unavailable. Even with relatively common languages (e.g. German) you can run into difficulty with shrinking or expanding letter counts and APIs that don't support unicode.
And then think of languages with different reading-ordering!
This is one of the reasons you should really plan it in from the beginning, and test the stuff to destruction on the set of languages you plan to support.
The concept of i18n and l10n is broader than merely translating strings to and fro some languages.
Example: Consider the input of date and time by users. If you haven't internationalization in mind when you design
a) the interface for the user and
b) the storage, retrieval and display mechanism
you will get a really bad time, when you want to enable other input schemes.
Agreed, in most cases i18n is not necessary in the first place. But, and that is my point, if you don't spend a thought on some areas, that must be touched for i18n, you will find yourself ending up rewriting large portions of the original code. And then, adding i18n is a lot more expensive than having spent some thought beforehand.
One thing that seems like it can be a big issue is the different character counts for a message in various languages. I do some work on iPhone apps and especially on a small screen if you design the UI for a message that has 10 characters and then you try to internationalize later and find you need 20 characters to display the same thing you now have to redo your UI to accommodate. Even with desktop apps this can still be a large PITA.
It depends on your project and how your team is organised.
I've been involved in the internationalization of a website, and it was one developer full-time for a year, probably about 6-8 months part-time for me to handle installation impacts when needed (reorganising files, etc), and other developers getting involved from time to time when their projects needed heavy refactoring. This was in an application that was at v3.
So that's definitely expensive. What you have to ask is how expensive is it to provide a localization system from the start, and how will that impact the project in the early stages. Your project at v1 may not be able to survive delays and setbacks caused by issues with a hastily-designed internationalization framework, while a stable v3 project with a wide customer base may have the capital to invest in doing that properly.
It also depends on whether you want to internationalize everything including log messages, or just the UI strings, and how many of those UI strings there are, and who you have available to do localization and the QA that goes with it, and even what languages you want to support - for example, does your system need to support unicode strings (which is a requirement for Asian languages).
And don;t forget that changing the database backend to support internationalized data can be costly as well. Just try to change that varchar field to nvarchar when you already have 20,000,000 records.
I think it depends on a language. Every j2ee(java web) app is i18n, because its very easy(even IDE can extract strings for you and you just name them).
In j2ee its cheaper to add it later, however the culture is to add them as soon as possible. I think its because j2ee uses a lot of open-source and almost all open-source libs are i18n. its great idea for them, but not for most j2ee app. most enterprise apps are just for one company that speak one language.
Plus if you have bad testers putting it too soon makes them give you bug reports about labels and translations(I only once saw translations done NOT by developers). After testers are done with it you have buggy app with excellent i18n support. However it might be fun for users to switch language and see if they can use it. However using your app its just boring work for them, so they wont even do that. The only users of i18n are the testers.
Weird string joining is not in j2ee culture since you know that one day someone might want to make it i18n. Only problem is extracting labels from html templates.
I cant say what is expensive but, i can tell you that a clean API lets you internationalise your Aplication at very low cost.

Improving the way we write code?

While thinking about software-engineering in general I came across the question why we don't see any improvements in the way we write/document code.
Think about it: There has not been a revolutionary improvement since we've moved from punch cards to text editing. The last improvement I've seen is syntax highlighting and context sensitive help (e.g. Intellisense or ctags). Not something I would call revolutionary.
That makes me wonder: Why is it so?
I'll start with something I miss badly:
Lots of my code deals with geometry.
For documentation describing geometric relationships always ends up in a big heap of hard to read mathematical stuff (due to the lack of proper equation type-setting in ASCII). However, if I could embed a little drawing or scribble into the code everything would be much easier, neater and better to be understood.
What can you think up that would make your coding/text editing/documention tasks easier?
I'm surprised that nobody has yet mentioned No Silver Bullet. In 1986 (!), Frederick Brooks predicted that:
There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity. [...] We cannot expect ever to see two-fold gains every two years."
And in 23 years, he's been proven right. We've come up with a number of things such as syntax highlighting and Intellisense which have improved productivity significantly, but certainly not by an order of magnitude. As time marches on, we'll continue to make several incremental improvements, but the fact is there is no silver bullet: there's not going to be some magical revelation in the way we write code that will improve productivity by an order of magnitude.
I'm suprised that no one seems to have mentioned Donald Knuth's seminal Literate Programming - write your code as if it were a book or a scientific paper.
There has not been a revolutionary improvement since we've moved from punch cards to text editing
Never used a line editor, have you?
But seriously, text (especially in the representations chosen for modern languages) is
easily processed
fairly easy to specify
information dense
precise
Anything that comes along to replace it has to be a net win across all four of those properties. Not easy.
I disagree. We do have changes, small, but changes.
How common is the "for each" construct? Compare it to 20 years ago. How about the Domain Specific Languages movement? What about the idea that we should code in layers? How about Behavior Driven Development? Coding by complying to a specification... which writes a nice document as output when all runs fine. How about the standarization of regular expressions? PCRE. What about Alan Kay's group's DSL related work on "Moore's Law for Software", which explored a more advanced implementation of Cairo and generated TCP/IP code using diagrams from RFCs?
Documentation is a two way dialog. Both as code being more understandable and people learning this special language. You wouldn't say that German needs documentation if you know German. I know natural languages are very far away from computer languages, but there's a movement to make code more expressive. It's not about the new tools, it's about how we are coding.
One thing I've done recently in some of the more math-heavy sections of my application is to include the LaTeX markup for the particular equation as a comment/docstring. Right now, I just copy-paste into an online equation editor, but it would be very helpful to see the formula itself (with things like Greek letters and sub/superscripts) rather than a bunch of ASCII code.
Source Code In Database. In a nutshell, source code is parsed and put into a database. You'd then need an integrated IDE to view and edit the code, but at this point, syntax is decoupled from format. YOUR IDE could show you a program in a way that's completely different from someone else's, tuned to the task you're working on. I'd list some specific examples, but that article covers pretty much everything.
I'm surprised nobody mentioned it - javadoc is basically HTML, so there's nothing preventing you from embedding images (or anything else) in code. Simple, effective and ubiquitous, it's one of the things Java did right.
DrScheme let's you do these things. Here's the things you can insert from the PLT website:
http://docs.plt-scheme.org/drscheme/Menus.html#(part._.Insert)
3.1.6 Insert
Insert Comment Box : Inserts a box that is ignored by DrScheme; use it to write comments for people who read your program.
Insert Image... : Opens a find-file dialog for selecting an image file in GIF, BMP, XBM, XPM, PNG, or JPG format. The image is treated as a value.
Insert Fraction... : Opens a dialog for a mixed-notation fraction, and inserts the given fraction into the current editor.
Insert Large Letters... : Opens a dialog for a line of text, and inserts a large version of the text (using semicolons and spaces).
Insert λ : Inserts the symbol λ (as a Unicode character) into the program. The λ symbol is normally bound the same as lambda.
Insert Java Comment Box : Inserts a box that is ignored by DrScheme. Unlike the Insert Comment Box menu item, this is designed for the ProfessorJ language levels. See ProfessorJ.
Insert Java Interactions Box : Inserts a box that will allows Java expressions and statements within Scheme programs. The result of the box is a Scheme value corresponding to the result(s) of the Java expressions. At this time, Scheme values cannot enter the box. The box will accept one Java statement or expression per line.
Insert XML Box : Inserts an XML; see XML Boxes and Scheme Boxes for more information.
Insert Scheme Box : Inserts a box to contain Scheme code, typically used inside an XML box; see XML Boxes and Scheme Boxes.
Insert Scheme Splice Box : Inserts a box to contain Scheme code, typically used inside an XML box; see also XML Boxes and Scheme Boxes.
Insert Pict Box : Creates a box for generating a Slideshow picture. Inside the pict box, insert and arrange Scheme boxes that produce picture values.
You also insert your unit tests with the code that you're testing. Pretty neat stuff.
I think integrated IDEs with semantic highlighting and **semantically-constrained suggestions* (a la IDEA or Eclipse) are a huge advancement.
But that happened 8-10 years ago.
Template-based programming feels useful never seems to catch on. Recently I was impressed with a demo of the Meta-programming system, which leverages the interactive nature of the IDE to simplify the task of writing templates and what are (essentially) type-aware macros.
Meta-programming might help you define geometry-based macros that would substitute for a number of lines of code. I could imagine something that let you embed a more-readable 'math language' inside Java, and then parses its contents into something machine-readable.
I'd say version control was a pretty huge leap in how we work. The ability to keep a full record of every change anyone has made to the codebase, and to revert changes where necessary, has made a big difference.
I certainly respect Fred Brooks' argument No Silver Bullet, but I think the way we write code is nowhere near optimal, so there is lots more room for improvement. I tried to explain this in my book.
We're all familiar with "code golf", where you compete relentlessly to minimize something. That is a good way to approach the minimum possible value of that something.
What's great about this is that you are allowed, even encouraged, to break from traditions, prior conceptions, accepted wisdom, in the quest for winning. In short, you learn new things.
If the measure to be minimized is wall-clock execution time, you can do aggressive optimization.
If the measure is source code size (lines or characters) you get "code golf".
The measure I like best is "edit count". That is, given a code base, suppose a new requirement comes along. That requirement is implemented, completely, by editing the code base. Then a "diff" is done from old to new code base. The number of differences found is the edit count. Averaged over the set of likely new functional requirements, that is the measure to minimize.
If this is done aggressively, being free to contradict all conventional wisdom, the code base approaches a state I would call a domain-specific language (DSL). In this language, concepts expressed in code are in nearly 1-1 correspondence with problem-oriented concepts. In this state, it is not easy for the source code to be self-inconsistent (i.e. have bugs) because the fewer edits that have to be made to the source code, the fewer chances there are to make a mistake. It's also the case that such code tends to be short. But unlike "code golf" it tends to be very clear, because it maps the problem concepts so clearly.
So, tools and techniques that help in minimizing edit count can, in my opinion, be considered "silver bullets". DSL is one such. Code generation is another. My favorite optimization technique is another. For coding dynamically changing UIs there is differential execution. There are bound to be more, waiting to be discovered. Of course, everything depends on the training and experience of the "marksman" (the coder).
I think there are lots of new ideas to be discovered. The trick is to tell the difference between the ones that move us forward, versus the ones that hold us back.
I think this is where Doxygen and other documentation systems help. If we can embed small, discrete comments that link to other information such as:
/* help: fooimg.png */
And then have an external documentation system do that, then great.
Even better would be allowing our text-editor to treat those things as hyperlinks to external documentation.
I would reference a drawing as a reference in the code documentation. I see no reason why you can't have footnotes in code.
The ability to make a section of code read-only is something I've wanted
It sounds like you might be interested in Jonathan Edward's research. See, for example:
"The Summer of Code"
"What's next?"
"The future of programming"
Diffing and searching pictures is hard. Diff and search are very important to programmers. Using pictures instead of text is only a marginal improvement in many situations, it has some drawbacks, and it requires general acceptance before it's really worth doing (since you don't make things more understandable if your reader doesn't grok what you've done).
Plus, programmers have a million little tricks that make their lives easier, based on text representations of code, that they'd lose if you gave them code to read that was expressed in anything other than text. Sure, they might replace or re-implement those tricks over time, but in the short term they're gone.
You don't see lawyers switching from English to little back-of-a-napkin diagrams in contracts, either (the Creative Commons licenses try, but cannot make the picture be the formal representation of the contract). Probably for similar reasons.
If someone comes up with a programming language and IDE that, on balance, beats text-based ones; and successfully markets it; then you'll see the start of a revolutionary shift from text to a new format. If nobody comes up with any such thing, then we're not missing out. If someone comes up with something that is more productive but it doesn't gain traction because of independent advantages of other technologies, then that loss is the price we pay for free-market capitalism. Perhaps the ideas will be recycled eventually...
That said, integration between code and documentation could clearly be improved, and there are many efforts underway to do so, using various techniques with varying success. Again, the problem is that any particular cunning plan can in practice only really be implemented in one or a few languages and development environments at a time, and so has difficulty proving that it really is better. Embedding documentation in code is possibly the only universal advance since the invention of the API...
I think there's still a lot that can be done with text, though. For example, debugger technology makes a big difference to programmer productivity in certain common circumstances (namely: when a test fails or something else unexpected happens, but it's not obvious what the faulty assumption is in the code you're looking at). There may be lower-hanging fruit in terms of making programming better, than the actual business of expressing the program.
The last improvement I've seen is
syntax highlighting and context
sensitive help
Then you haven't looked much. Modern IDEs can do far, FAR more than that, namely show you the semantic structure of code (e.g. inheritance hierarchies) and even manipulate it (automatic refactoring) or enrich it with external data (such as who last changed a particular line of code).
I've used emacs, I like text macros. But, what I really want is parse macros. I'd like my editor to expose the machinery behind refactoring in such a way that I can write my transformations on the parse tree of the language itself.
For example, Python added += at one point when my code was littered with x = x + 1 lines. If I could have written a search and replace command that worked on the parse tree, I could have quickly cleaned up large amounts of my source code.
So, I want standard search and replace, but I want it at the level where the structure of my code has meaning, at the abstract syntax tree.
If you've ever used ReSharper, each of its refactorings and recommendations are written in the manner I describe, they find a pattern in the parse tree and suggest a replacement, or for a refactoring, apply a known replacement. I want access to that machinery for my own tasks!
Have you used Doxygen or similar for documenting your code? You can add links to images, and other file types (often stored in same directory as source code) that will get sucked into the generated documentation. I realize that this is one step removed from seeing the detail directly in you favorite editor but it definitely improves how we document our code.
Programming languages are a specialized form of mathematical notation, since you can express a programming language mathematically. Notation changes slowly, and so we don't get fast progress in our languages. Mostly, we advance when we come up with a new thing to fit into the notation, like using i to refer to the square root of negative one.
There are documentation schemes that allow you to embed things other than text. There was at least one programming scheme, Donald Knuth's Web, that allowed you to have a presentation and an execution version of a program (unfortunately, the base source code, the stuff you'd actually hack, was rather messy).
You could easily have a text editor that could treat comments as HTML, provided of course it could recognize comments as it saw them.
I've been thinking a lot about how to make coding faster and more efficient for the past years, always trying to keep it realistic and doing minimalistic implementations. Those are not revolutionary ideas, but since the original poster talked about the punching card to code typing transition, I thought of talking about other ways of communicating to the computer what we want to program.
My ideas are visual or vocal programming. The motivation behind is that there are only a number of ways a loop can be efficiently programmed, and an aware IDE could make some smart code substitution decisions depending on inputs other than typed lines of code.
Visual programming vs Coding: encapsulate (literally) code into "boxes" which have inputs and outputs, and connect them together across a horizontal timeline. This is a high-level concept that would be intrinsically interesting for multithreading development since you can have multiple lines or threads happening at the same time. Every process can be divided into a "box", no matter how you see it. Sending an e-mail in its most basic form is a box which takes an email as input and outputs a success/fail signal. Since the boxes and the lines are distributed across a timeline, the notion of time and event chronology isn't lost and feedback lines are possible.
Vocal programming vs Coding: The effectiveness of this technique would revolve around the effectiveness of the vocal syntax decided to create code and move the cursor. For example, you can say to the microphone "for variable zero to 10" and the system will automatically generate the following code placing the cursor inside:
for (x=0;x<10;x++){
// Cursor would be there after after the call
}
In terms of usability, you would need to be in a relatively silent room to minimize other sounds that might harm the voice recognition so this technology could be used in specialized environments mostly.
This is the result of my extensive programming experience using a wide range of hardware and programming languages. Let me know what you guys think, I'd love having a constructive discussion about that.
A few weeks back the "Intentional Software" created quite a buzz about their new language. I've yet to watch the presentation, but here is a quote from a review by Martin Fowler:
They started worryingly, with the
usual unrevealing Powerpoints, but
then they switched to showing the
workbench and the curtain finally
opened. To gauge the reaction, take a
look at Twitter.
#pandemonial Quite impressed! This is sweet! Multiple domains, multiple
langs, no question is going unanswered
#csells OK, watching a live electrical circuit rendered and
working in a C# file is pretty damn
cool.
#jolson Two words to say about the Electronics demo for Intentional
Software: HOLY CRAPOLA. That's it, my
brain has finally exploded.
#gblock This is not about snazzy demos, this is about completely
changing the world we know it.
#twleung ok, the intellisense for the actuarial formulas is just awesome
#lobrien This is like seeing a 100-mpg carburetor : OMG someone is
going to buy this and put it in a
vault!
Two quotes come instantly to mind:
"If it ain't broke, don't fix it."
"Use the best tool for the job."
Of course, although the core code is still written as text, alll the tools and libraries have changed massively since the days of punched cards.
This has been touched on by others, and it wouldn't revolutionise programming, but anyway...
I think it would be nice if code editors moved slightly beyond plain text editors. Even with syntax highlighting and code completion (which I think are incredibly good things), the editors of today (at least, the ones I use) still display exactly the same ASCII text (or whatever encoding is used) that is in the source files. I would be interested to see how well it would work if editors displayed, for example (some examples are more adventurous than others):
Comments in a text box with a light-blue background and no // or /* ... */ visible
Javadoc comments could have semi-rich text editing support (for those who do HTML Javadoc comments) (seriously, I would appreciate it if code editors rendered Javadoc comments as HTML because they're not the easiest to skim over when their HTML as plain text)
Functions in text boxes that could be collapsed to show only the signature (the collapsing can be done by current editors) and can be dragged around as boxes
Lines between function boxes to indicate how functions are connected
Zooming out so that rather than seeing a single source file (class in many languages) you can see multiple files and the way connect to each other (this would essentially be building UML-like diagramming directly into the code editor)
I think this (in my mind at least) would work without requiring additional markup in the source files so users of plain text editors wouldn't be disadvantaged by having all this extra markup cluttering the files.
Part of the problem might stem from the fact that when you don't code we don't call it programming: Assembling modular components using a GUI for instance.
You might be interested in these alternative programming "languages".
[Ladder][1], designed to mimic the way relay-logic-schemes work. Horrible IMO, but easy to understand for the old guys who did logic with sticks and stones. [http://www.amci.com/tutorials/images/ladder-diagram.gif][2]
[SFC, Sequential function chart][3], designed to simplify parallell programming. Code is written into boxes and these boxes can be placed paralell to each other and will thus execute simultaneously. By connecting the end of several boxes you can syncronize events. Very common for automation applications.
[Mathematica][5]!!!, Might not be the best programming language but the syntax highlighting(if you can call it that) is awesome! For example you can input a matrix by seeing the matrix nicly aligned instead of a huge double[][]. Graphs can be inserted in the code and the formatting of mathematical expressions looks like it does when you write on a paper. No more paranthesis-madness or long Math.PI expressions that really only need one character. And best of all, the files are just plain text even if it is rendered nicely in the editor!
Debuggers is also an area where lots of improvement has been done. Debuggers with replay are starting to come and also visual debuggers where data can be modified in real time. Edit and continiue is also a feature i wouldn't want to live withot.
WTF "new users can only post a maximum of one hyperlink", you will have to google the stuff i originally added to this post >:(
A brain-to-computer translator. Typing is the real
bottleneck. It really just needs to derive the algorithms I
think up and convert that to machine code.
I would say a lot of the newer languages are pretty great at
quickly creating algorithms. The improvements aren't so much
revolutionary now as they are evolutionary.
Dare I say it might actually be a new development language (perhaps even a new paradigm) to take us through such revolution;
I think you might want to take a look at Leo. This is one guys attempt at answering what you're asking about. I still can't wrap my VIM head around it personally, but others take to it quickly. It's not just a programming IDE, but more of an information organizer. It's written in Python, but I don't see why you can't code in other languages with it. The power of Leo is not so much the language, but the ability to express your thoughts and organize them whether it be in code, diagrams, images, or diagrams. Look over the tutorial and examples to get a feel for it. You might like it.
Automated semantic source code transformations, where a program can be reliably examined and manipulated by using an abstract interface/frontend to it that is aware of the underlying semantics.
So that source code can be queried and dealt with pretty much like a SQL database.
Allowing you to do static analysis of source code and refactor even complex source code by doing something along the lines of:
FIND CALLERS OF FUNCTION "foo" WHERE SIGNATURE("int","int","char*") AND RETURN_TYPE("bool");
...
RENAME MACRO "max" TO "maximum" IN FILE "macros.hxx";
RENAME NAMESPACE "prj" TO "project";
RENAME SYMBOL "OLDFOO" IN NAMESPACE "project";
RENAME FUNCTION "log" TO "show_log";
RENAME CLASS "FOO" TO "OLDFOO";
RENAME METHOD "FOO::inc" TO "FOO::increment";
...
CHANGE SIGNATURE IN FUNCTION "foo" WHERE SIGNATURE("int","int") TO SIGNATURE("double","double");
CHANGE SIGNATURE IN METHOD "myClass::handle" WHERE SIGNATURE("char") TO SIGNATURE("unsigned char")
MOVE FUNCTION "foo" in FILE "stuff.cc" TO "foo_funcs.cc";

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.

Are you fluent in Unicode yet?

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?
I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.
So for my benefit (and I believe many others) can I get some input from people on the following:
How to "get over" ASCII once and for all
Fundamental guidance when working with Unicode.
Recommended (recent) books and websites on Unicode (for developers).
Current state of Unicode (5 years after Joels' article)
Future directions.
I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.
Update: See this related question also asked on StackOverflow previously.
Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.
Here some interesting articles (besides Joel's article) on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
A quote from the first article; Tips for using Unicode:
Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
Spend some time poking around the Unicode web site and learning how the code charts work.
If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.
I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.
Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.
The .NET Framework uses Windows default encoding for storing strings, which turns out to be UTF-16. If you don't specify an encoding when you use most text I/O classes, you will write UTF-8 with no BOM and read by first checking for a BOM then assuming UTF-8 (I know for sure StreamReader and StreamWriter behave this way.) This is pretty safe for "dumb" text editors that won't understand a BOM but kind of cruddy for smarter ones that could display UTF-8 or the situation where you're actually writing characters outside the standard ASCII range.
Normally this is invisible, but it can rear its head in interesting ways. Yesterday I was working with someone who was using XML serialization to serialize an object to a string using a StringWriter, and he couldn't figure out why the encoding was always UTF-16. Since a string in memory is going to be UTF-16 and that is enforced by .NET, that's the only thing the XML serialization framework could do.
So, when I'm writing something that isn't just a throwaway tool, I specify a UTF-8 encoding with a BOM. Technically in .NET you will always be accidentally Unicode aware, but only if your user knows to detect your encoding as UTF-8.
It makes me cry a little every time I see someone ask, "How do I get the bytes of a string?" and the suggested solution uses Encoding.ASCII.GetBytes() :(
Rule of thumb: if you never munge or look inside a string and instead treat it strictly as a blob of data, you'll be much better off.
Even doing something as simple as splitting words or lowercasing strings becomes tough if you want to do it "the Unicode way".
And if you want to do it "the Unicode way", you'll need an awfully good library. This stuff is incredibly complex.