How prevalent is UTF-8 really? - language-agnostic

How wide-spread is the use of UTF-8 for non-English text, on the WWW or otherwise? I'm interested both in statistical data and the situation in specific countries.
I know that ISO-8859-1 (or 15) is firmly entrenched in Germany - but what about languages where you have to use multibyte encodings anyway, like Japan or China? I know that a few years ago, Japan was still using the various JIS encodings almost exclusively.
Given these observations, would it even be true that UTF-8 is the most common multibyte encoding? Or would it be more correct to say that it's basically only used internally in new applications that specifically target an international market and/or have to work with multi-language texts? Is it acceptable nowadays to have an app that ONLY uses UTF-8 in its output, or would each national market expect output files to be in a different legacy encoding in order to be usable by other apps.
Edit:
I am NOT asking whether or why UTF-8 is useful or how it works. I know all that. I am asking whether it is actually being adopted widely and replacing older encodings.

We use UTF-8 in our service-oriented web-service world almost exclusively - even with "just" Western European languages, there are a enough "quirks" to using various ISO-8859-X formats to make our heads spin - UTF-8 really just totally solves that.
So I'd put in a BIG vote for use of UTF-8 everywhere and all the time ! :-) I guess in a service-oriented world and in .NET and Java environments, that's really not an issue or a potential problem anymore.
It just solves so many problems that you really don't need to have to deal with all the time......
Marc

As of 11 April 2021 UTF-8 is used on 96.7% of websites.

I don't think it's acceptable to just accept UTF-8 - you need to be accepting UTF-8 and whatever encoding was previously prevalent in your target markets.
The good news is, if you're coming from a German situation, where you mostly have 8859-1/15 and ASCII, additionally accepting 8859-1 and converting it into UTF-8 is basically zero-cost. It's easy to detect: using 8859-1-encoded ö or ü is invalid UTF-8, for example, without even going into the easily-detectable invalid pairs. Using characters 128-159 is unlikely to be valid 8859-1. Within a few bytes of your first high byte, you can generally have a very, very good idea of which encoding is in use. And once you know the encoding, whether by specification or guessing, you don't need a translation table to convert 8859-1 to Unicode - U+0080 through to U+00FF are exactly the same as the 0x80-0xFF in 8859-1.

Is it acceptable nowadays to have an
app that ONLY uses UTF-8 in its
output, or would each national market
expect output files to be in a
different legacy encoding in order to
be usable by other apps.
Hmm, depends on what kind of apps and output we're talking about... In many cases (e.g. most web-based stuff) you can certainly go with UTF-8 only, but, for example, in a desktop application that allows user to save some data in plain text files, I think UTF-8 only is not enough.
Mac OS X uses UTF-8 extensively, and it's the default encoding for users' files, and this is the case in most (all?) major Linux distributions too. But on Windows... is Windows-1252 (close but not same as ISO-8859-1) still the default encoding for many languages? At least in Windows XP it was, but I'm not sure if this has changed? In any case, so long as significant number of (mostly Windows) users have the files on their computers encoded in Windows-1252 (or something close to that), supporting UTF-8 only would cause grief and confusion for many.
Some country specific info: in Finland ISO-8859-1 (or 15) is likewise still firmly entrenched. As an example, Finnish IRC channels use, afaik, still mostly Latin-1. (Which means Linux guys with UTF-8 as system default using text-based clients (e.g. irssi) need to do some workarounds / tweak settings.)

I tend to visit Runet websites quite often. Many of them still use Windows-1251 encoding. Also it's the default encoding in Yandex Mail and Mail.ru (two largest webmail services in CIS countries). It's also set as a the default content encoding in Opera browser (2nd after Firefox by popularity in the region) when one downloads it from Russian ip address. I'm not quite sure about other browsers though.
The reason for that is quite simple: UTF-8 requires two bytes to encode Cyrillic letters. Non-unicode encodings require 1 byte only (unlike most Eastern alphabets Cyrillic ones are quite small). They are also fixed-length and easily processable by old ASCII-only tools.

Here are some statistics I was able to find:
This page shows usage statistics for character encodings in "top websites".
This page is another example.
Both of these pages seem to suffer from significant problems:
It is not clear how representative their sample sets are, particularly for non English speaking countries.
It is not clear what methodologies were used to gather the statistics. Are they counting pages or counts of page accesses? What about downloadable / downloaded content.
More important, the statistics are only for web-accessible content. Broader statistics (e.g. for encoding of documents on user's hard drives) do not seem to be obtainable. (This does not surprise me, given how difficult / costly it would be to do the studies needed across many countries.)
In short, your question is not objectively answerable. You might be able to find studies somewhere about how "acceptable" a UTF-8 only application might be in specific countries, but I was not able to find any.
For me, the take away is that it is a good idea to write your applications to be character encoding agnostic, and let the user decide which character encoding to use for storing documents. This is relatively easy to do in modern languages like Java and C#.

Users of CJK characters are biassed against UTF-8 naturally because their characters become 3 bytes each instead of two. Evidently, in China the preference is for their own 2-byte GBK encoding, not UTF-16.
Edit in response to this comment by #Joshua :
And it turns out for most web work the pages would be smaller in UTF-8 anyway as the HTML and javascript characters now encode to one byte.
Response:
The GB.+ encodings and other East Asian encodings are variable length encodings. Bytes with values up to 0x7F are mapped mostly to ASCII (with sometimes minor variations). Some bytes with the high bit set are lead bytes of sequences of 2 to 4 bytes, and others are illegal. Just like UTF-8.
As "HTML and javascript characters" are also ASCII characters, they have ALWAYS been 1 byte, both in those encodings and in UTF-8.

UTF-8 is popular because it is usually more compact than UTF-16, with full fidelity. It also doesn't suffer from the endianness issue of UTF-16.
This makes it a great choice as an interchange format, but because characters encode to varying byte runs (from one to four bytes per character) it isn't always very nice to work with. So it is usually cleaner to reserve UTF-8 for data interchange, and use conversion at the points of entry and exit.
For system-internal storage (including disk files and databases) it is probably cleaner to use a native UTF-16, UTF-16 with some other compression, or some 8-bit "ANSI" encoding. The latter of course limits you to a particular codepage and you can suffer if you're handling multi-lingual text. For processing the data locally you'll probably want some "ANSI" encoding or native UTF-16. Character handling becomes a much simpler problem that way.
So I'd suggest that UTF-8 is popular externally, but rarer internally. Internally UTF-8 seems like a nightmare to work with aside from static text blobs.
Some DBMSs seem to choose to store text blobs as UTF-8 all the time. This offers the advantage of compression (over storing UTF-16) without trying to devise another compression scheme. Because conversion to/from UTF-8 is so common they probably make use of system libraries that are known to work efficiently and reliably.
The biggest problems with "ANSI" schemes are being bound to a single small character set and needing to handle multibyte character set sequences for languages with large alphabets.

While it does not specifically address the question -- UTF-8 is the only character encoding mandatory to implement in all IETF track protocols.
http://www.ietf.org/rfc/rfc2277.txt

You might be interested in this question. I've been trying to build a CW about the support for unicode in various languages.

I'm interested both in statistical
data and the situation in specific
countries.
On W3Techs, we have all these data, but it's perhaps not easy to find:
For example, you get the character encoding distribution of Japanese websites by first selecting the language: Content Languages > Japanese, and then you select Segmentation > Character Encodings. That brings you to this report: Distribution of character encodings among websites that use Japanese. You see: Japanese sites use 49% SHIFT-JIS and 38% UTF-8. You can do the same per top level domain, say all .jp sites.

Both Java and C# use UTF-16 internally and can easily translate to other encodings; they're pretty well entrenched in the enterprise world.
I'd say accepting only UTF as input is not that big a deal these days; go for it.

I'm interested both in statistical
data and the situation in specific
countries.
I think this is much more dependent on the problem domain and its history, then on the country in which an application is used.
If you're building an application for which all your competitors are outputting in e.g. ISO-8859-1 (or have been for the majority of the last 10 years), I think all your (potential) clients would expect you to open such files without much hassle.
That said, I don't think most of the time there's still a need to output anything but UTF-8 encoded files. Most programs cope these days, but once again, YMMV depending on your target market.

Related

Pro's and Con's of using HTML Codes vs Special Characters

When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: é
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.
Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ?
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.

Are these characters valid in practice in a url in href in HTML5 document? :,;&

We are planning to have urls like
http://www.nowhere.nu/path/file;key:value,key2:value2.html
This document says the special characters must be escaped, which kind of defeats the purpose of having "nice" urls:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
However, testing the major browsers, it seems unnecessary to urlencode ;: and , in this position
Can you provide a popular browser where it doesn't work, or an argument against such a scheme?
Similar question for & (invalid but works) versus & amp; (valid).
As mentioned in the comments the information referenced is outdated. Any questionable characters should be escaped for safety.
With URLs you have 2 target audiences, humans and machines. Most humans cannot efficiently process URL paths that contain anything but alphanumeric characters, .s, and /s. They also degrade in efficiency the further along you go past the host name. So essentially once you get to a character that you would have to encode, you've already lost being human friendly (and many don't even know where the location bar is let alone pay any attention to it).
Machines on the other hand don't have a problem with escaped characters. They may have a problem with un-escaped characters that they don't recognize. As HTTP is incredibly simple there are tons of clients available including little custom clients for doing things like mashups. So even if you cover the major browsers being able to plow through your unescaped characters, you may break some unforeseen client. Overall you're left with no practical benefit but higher risk.

Is UTF-8 enough for all common languages?

I just wanted to develop a translation app in a Django projects which enables registered users with certain permissions to translate every single message it appears in latest version.
My question is, what character set should I use for database tables in this translation app? Looks like some european language characters cannot be stored in UTF-8?
Looks like some european language characters cannot be stored in UTF-8?
Not true. UTF-8 can store any character set without limitations except maybe for Klingon. UTF-8 is your one stop shop for internationalization. If you have problems with characters, they are most likely to be encoding problems, or missing support for that character range in the font you're using to display the data with (Extremely unlikely for a european language character though, but common e.g. when viewing indian sites on an european computer. See also this question)
If a non-western character set can't be rendered, it could be that the user's built in font does not have that range of UTF-8 covered.
Update: Klingon it is indeed not part of official UTF-8:
Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon) are listed in the ConScript Unicode Registry, along with unofficial but widely-used Private Use Area code assignments.
However, there is a volunteer project that has inofficially assigned code points F8D0-F8FF in the private area to Klingon. Gallery of Klingon characters
UTF-8 can be used to represent all of Unicode, so it doesn't let you express all common languages. It allows you to express all languages.
If it seems as if some european characters aren't working, that's an encoding issue.

Are you fluent in Unicode yet?

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?
I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.
So for my benefit (and I believe many others) can I get some input from people on the following:
How to "get over" ASCII once and for all
Fundamental guidance when working with Unicode.
Recommended (recent) books and websites on Unicode (for developers).
Current state of Unicode (5 years after Joels' article)
Future directions.
I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.
Update: See this related question also asked on StackOverflow previously.
Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.
Here some interesting articles (besides Joel's article) on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
A quote from the first article; Tips for using Unicode:
Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
Spend some time poking around the Unicode web site and learning how the code charts work.
If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.
I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.
Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.
The .NET Framework uses Windows default encoding for storing strings, which turns out to be UTF-16. If you don't specify an encoding when you use most text I/O classes, you will write UTF-8 with no BOM and read by first checking for a BOM then assuming UTF-8 (I know for sure StreamReader and StreamWriter behave this way.) This is pretty safe for "dumb" text editors that won't understand a BOM but kind of cruddy for smarter ones that could display UTF-8 or the situation where you're actually writing characters outside the standard ASCII range.
Normally this is invisible, but it can rear its head in interesting ways. Yesterday I was working with someone who was using XML serialization to serialize an object to a string using a StringWriter, and he couldn't figure out why the encoding was always UTF-16. Since a string in memory is going to be UTF-16 and that is enforced by .NET, that's the only thing the XML serialization framework could do.
So, when I'm writing something that isn't just a throwaway tool, I specify a UTF-8 encoding with a BOM. Technically in .NET you will always be accidentally Unicode aware, but only if your user knows to detect your encoding as UTF-8.
It makes me cry a little every time I see someone ask, "How do I get the bytes of a string?" and the suggested solution uses Encoding.ASCII.GetBytes() :(
Rule of thumb: if you never munge or look inside a string and instead treat it strictly as a blob of data, you'll be much better off.
Even doing something as simple as splitting words or lowercasing strings becomes tough if you want to do it "the Unicode way".
And if you want to do it "the Unicode way", you'll need an awfully good library. This stuff is incredibly complex.

Internationalization in your projects

How have you implemented Internationalization (i18n) in actual projects you've worked on?
I took an interest in making software cross-cultural after I read the famous post by Joel, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However, I have yet to able to take advantage of this in a real project, besides making sure I used Unicode strings where possible. But making all your strings Unicode and ensuring you understand what encoding everything you work with is in is just the tip of the i18n iceberg.
Everything I have worked on to date has been for use by a controlled set of US English speaking people, or i18n just wasn't something we had time to work on before pushing the project live. So I am looking for any tips or war stories people have about making software more localized in real world projects.
It has been a while, so this is not comprehensive.
Character Sets
Unicode is great, but you can't get away with ignoring other character sets. The default character set on Windows XP (English) is Cp1252. On the web, you don't know what a browser will send you (though hopefully your container will handle most of this). And don't be surprised when there are bugs in whatever implementation you are using. Character sets can have interesting interactions with filenames when they move to between machines.
Translating Strings
Translators are, generally speaking, not coders. If you send a source file to a translator, they will break it. Strings should be extracted to resource files (e.g. properties files in Java or resource DLLs in Visual C++). Translators should be given files that are difficult to break and tools that don't let them break them.
Translators do not know where strings come from in a product. It is difficult to translate a string without context. If you do not provide guidance, the quality of the translation will suffer.
While on the subject of context, you may see the same string "foo" crop up in multiple times and think it would be more efficient to have all instances in the UI point to the same resource. This is a bad idea. Words may be very context-sensitive in some languages.
Translating strings costs money. If you release a new version of a product, it makes sense to recover the old versions. Have tools to recover strings from your old resource files.
String concatenation and manual manipulation of strings should be minimized. Use the format functions where applicable.
Translators need to be able to modify hotkeys. Ctrl+P is print in English; the Germans use Ctrl+D.
If you have a translation process that requires someone to manually cut and paste strings at any time, you are asking for trouble.
Dates, Times, Calendars, Currency, Number Formats, Time Zones
These can all vary from country to country. A comma may be used to denote decimal places. Times may be in 24hour notation. Not everyone uses the Gregorian calendar. You need to be unambiguous, too. If you take care to display dates as MM/DD/YYYY for the USA and DD/MM/YYYY for the UK on your website, the dates are ambiguous unless the user knows you've done it.
Especially Currency
The Locale functions provided in the class libraries will give you the local currency symbol, but you can't just stick a pound (sterling) or euro symbol in front of a value that gives a price in dollars.
User Interfaces
Layout should be dynamic. Not only are strings likely to double in length on translation, the entire UI may need to be inverted (Hebrew; Arabic) so that the controls run from right to left. And that is before we get to Asia.
Testing Prior To Translation
Use static analysis of your code to locate problems. At a bare minimum, leverage the tools built into your IDE. (Eclipse users can go to Window > Preferences > Java > Compiler > Errors/Warnings and check for non-externalised strings.)
Smoke test by simulating translation. It isn't difficult to parse a resource file and replace strings with a pseudo-translated version that doubles the length and inserts funky characters. You don't have to speak a language to use a foreign operating system. Modern systems should let you log in as a foreign user with translated strings and foreign locale. If you are familiar with your OS, you can figure out what does what without knowing a single word of the language.
Keyboard maps and character set references are very useful.
Virtualisation would be very useful here.
Non-technical Issues
Sometimes you have to be sensitive to cultural differences (offence or incomprehension may result). A mistake you often see is the use of flags as a visual cue choosing a website language or geography. Unless you want your software to declare sides in global politics, this is a bad idea. If you were French and offered the option for English with St. George's flag (the flag of England is a red cross on a white field), this might result in confusion for many English speakers - assume similar issues will arise with foreign languages and countries. Icons need to be vetted for cultural relevance. What does a thumbs-up or a green tick mean? Language should be relatively neutral - addressing users in a particular manner may be acceptable in one region, but considered rude in another.
Resources
C++ and Java programmers may find the ICU website useful: http://www.icu-project.org/
Some fun things:
Having a PHP and MySQL Application that works well with German and French, but now needs to support Russian and Chinese. I think I move this over to .net, as PHP's Unicode support is - in my opinion - not really good. Sure, juggling around with utf8_de/encode or the mbstring-functions is fun. Almost as fun as having Freddy Krüger visit you at night...
Realizing that some languages are a LOT more Verbose than others. German is a LOT more verbose than English usually, and seeing how the German Version destroys the User Interface because too little space was allocated was not fun. Some products gained some fame for their creative ways to work around that, with Oblivion's "Schw.Tr.d.Le.En.W." being memorable :-)
Playing around with date formats, woohoo! Yes, there ARE actually people in the world who use date formats where the day goes in the middle. Sooooo much fun trying to find out what 07/02/2008 is supposed to mean, just because some users might believe it could be July 2... But then again, you guys over the pond may believe the same about users who put the month in the middle :-P, especially because in English, July 2 sounds a lot better than 2nd of July, something that does not neccessarily apply to other languages (i.e. in German, you would never say Juli 2 but always Zweiter Juli). I use 2008-02-07 whenever possible. It's clear that it means February 7 and it sorts properly, but dd/mm vs. mm/dd can be a really tricky problem.
Anoter fun thing, Number formats! 10.000,50 vs 10,000.50 vs. 10 000,50 vs. 10'000,50... This is my biggest nightmare right now, having to support a multi-cultural environent but not having any way to reliably know what number format the user will use.
Formal or Informal. In some language, there are two ways to address people, a formal way and a more informal way. In English, you just say "You", but in German you have to decide between the formal "Sie" and the informal "Du", same for French Tu/Vous. It's usually a safe bet to choose the formal way, but this is easily overlooked.
Calendars. In Europe, the first day of the Week is Monday, whereas in the US it's Sunday. Calendar Widgets are nice. Showing a Calendar with Sunday on the left and Saturday on the right to a European user is not so nice, it confuses them.
I worked on a project for my previous employer that used .NET, and there was a built in .resx format we used. We basically had a file that had all translations in the .resx file, and then multiple files with different translations. The consequence of this is that you have to be very diligent about ensuring that all strings visible in the application are stored in the .resx, and anytime one is changed you have to update all languages you support.
If you get lazy and don't notify the people in charge of translations, or you embed strings without going through your localization system, it will be a nightmare to try and fix it later. Similarly, if localization is an afterthought, it will be very difficult to put in place. Bottom line, if you don't have all visible strings stored externally in a standard place, it will be very difficult to find all that need to be localized.
One other note, very strictly avoid concatenating visible strings directly, such as
String message = "The " + item + " is on sale!";
Instead, you must use something like
String message = String.Format("The {0} is on sale!", item);
The reason for this is that different languages often order the words differently, and concatenating strings directly will need a new build to fix, but if you used some kind of string replacement mechanism like above, you can modify your .resx file (or whatever localization files you use) for the specific language that needs to reorder the words.
I was just listening to a Podcast from Scott Hanselman this morning, where he talks about internationalization, especially the really tricky things, like Turkish (with it's four i's) and Thai. Also, Jeff Atwood had a post:
Besides all the previous tips, remember that i18n it's not just about changing words for their equivalent on other languages, especially for non-latin languages alphabets (korean, Arabic) which written right to left, so the whole UI will have to conform, like
item 1
item 2
item 3
would have to be
arabic text 1 -
arabic text 2 -
arabic text 3 -
(reversed bullet list doesn't seem to work :P)
which can be a UI nightmare if your system has to apply changes dinamically once the user changes the language being used.
Another very hard thing is to test different languages, not just for the correctness of word, but since languages like Korean usually have bigger font type for their characters this may lead to language specific bugs (like "SAVE" text on a button being larger than the button itself for some language).
One of the funnier things to discover: italics and bold text makrup does not work with CJK (Chinese/Japanese/Korean) characters. They simply become unreadable. (OK, I couldn't really read them before either, but especially bolding just creates ink blots)
I think everyone working in internationalization should be familiar with the Common Locale Data Repository, which is now a sub-project of Unicode:
Common Locale Data Repository
Those folks are working hard to establish a standard resource for all kinds of i18n issues: currency, geographical names, tons of stuff. Any project that's maintaining its own core local data given that this project exists is pretty bonkers, IMHO.
I suggest to use something like 99translations.com to maintain your translations . Otherwise you won't be able to tell what of your translations are up to date in every language.
Another challenge will be accepting input from your users. In many cases, this is eased by the input processing provided by the operating system, such as IME in Windows, which works transparently with common text widgets, but this facility will not be available for every possible need.
One website I use has a translation method the owner calls "wiki + machine translation". This is a community based site so is obviously different to the needs of companies.
http://blog.bookmooch.com/2007/09/23/how-bookmooch-does-its-translations/
One thing no one have mentioned yet is strings with some warying part as in "The unit will arive in 5 days" or "On Monday something happens." where 5 and Monday will change depending on state. It is not a good idea to split those in two and concatenate them. With only one varying part and good documentation you might get away with it, with two varying parts there will be some language that preferes to change the order of them.