Apache Tika / TESSERACT issue - ocr

I have a question concerning Apache TIKA / TESSERACT.
I need to OCR some poor quality documents which contain different alphabets e.g. german/polish/english.
I did OCR with all of the alphabets at first because I thought it would be faster (I mean: by using http header "X-Tika-OCRLanguage").
I noticed that some characters were misidentified so I thought that result will be better after reducing the number of alphabets to those that de facto appear in the document (I checked it manually).
But after the reduction of languages, there are characters replaced incorrectly with other characters even though the first identification was correct.
Example:
The correct spelling: DÖNER
1st try: http header "X-Tika-OCRLanguage" includes english, polish, german
the result: DÖNĘR
2nd try: http header "X-Tika-OCRLanguage" includes only english and german
the result: DONER
I don't know what is going on. Is there anyone who know how to explain this issue? Is there anything what I possibly could do to improve the outcome?

Related

How to do OCR on a single character

I am writing a program that should be able to detect a single character from the image of it.
I think it should be pretty easy given how powerful OCR software have become these days but I have no real idea how to do it.
Here are the specifics:
The language is Persian
The character is not hand written.
There are no words or sentences, the image is of a single character generated from a PDF file. It will look like this:
Now ideally I should be able to perform OCR on this image and determine the character.
But I was using another approach so far. The fonts used in the PDF files are from a finite set of fonts (100 something) and from those only 2-3 fonts are usually used. So I can actually "cheat", and compare this character to all the characters of these 100 fonts and determine what it is.
As an example these are some of the characters in the font "Roya". I intended to compare my character image with all of these and determine the letter. Repeat for every other font until a match is found.
I was doing a bitmap compare with imagemagick but I realized that even if the fonts are the same there are still small differences between the character images generated from the same font.
As an example, these two are both the character "beh" from the font "Zar". But as you can see there won't be an exact match when doing a bitmap compare between them:
So given all this how should I go about doing the OCR?
Other notes:
The program is written in Java, but a standalone application or a C/C++ library is also acceptable.
I tried using Tesseract but I just couldn't get it to detect characters. Persian was very badly documented and it looked like it would need a ton of calibration and training. It also looked like it is optimized for detecting words and gave very bad results when detecting single characters.

Pro's and Con's of using HTML Codes vs Special Characters

When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: é
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.
Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ?
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.

How prevalent is UTF-8 really?

How wide-spread is the use of UTF-8 for non-English text, on the WWW or otherwise? I'm interested both in statistical data and the situation in specific countries.
I know that ISO-8859-1 (or 15) is firmly entrenched in Germany - but what about languages where you have to use multibyte encodings anyway, like Japan or China? I know that a few years ago, Japan was still using the various JIS encodings almost exclusively.
Given these observations, would it even be true that UTF-8 is the most common multibyte encoding? Or would it be more correct to say that it's basically only used internally in new applications that specifically target an international market and/or have to work with multi-language texts? Is it acceptable nowadays to have an app that ONLY uses UTF-8 in its output, or would each national market expect output files to be in a different legacy encoding in order to be usable by other apps.
Edit:
I am NOT asking whether or why UTF-8 is useful or how it works. I know all that. I am asking whether it is actually being adopted widely and replacing older encodings.
We use UTF-8 in our service-oriented web-service world almost exclusively - even with "just" Western European languages, there are a enough "quirks" to using various ISO-8859-X formats to make our heads spin - UTF-8 really just totally solves that.
So I'd put in a BIG vote for use of UTF-8 everywhere and all the time ! :-) I guess in a service-oriented world and in .NET and Java environments, that's really not an issue or a potential problem anymore.
It just solves so many problems that you really don't need to have to deal with all the time......
Marc
As of 11 April 2021 UTF-8 is used on 96.7% of websites.
I don't think it's acceptable to just accept UTF-8 - you need to be accepting UTF-8 and whatever encoding was previously prevalent in your target markets.
The good news is, if you're coming from a German situation, where you mostly have 8859-1/15 and ASCII, additionally accepting 8859-1 and converting it into UTF-8 is basically zero-cost. It's easy to detect: using 8859-1-encoded ö or ü is invalid UTF-8, for example, without even going into the easily-detectable invalid pairs. Using characters 128-159 is unlikely to be valid 8859-1. Within a few bytes of your first high byte, you can generally have a very, very good idea of which encoding is in use. And once you know the encoding, whether by specification or guessing, you don't need a translation table to convert 8859-1 to Unicode - U+0080 through to U+00FF are exactly the same as the 0x80-0xFF in 8859-1.
Is it acceptable nowadays to have an
app that ONLY uses UTF-8 in its
output, or would each national market
expect output files to be in a
different legacy encoding in order to
be usable by other apps.
Hmm, depends on what kind of apps and output we're talking about... In many cases (e.g. most web-based stuff) you can certainly go with UTF-8 only, but, for example, in a desktop application that allows user to save some data in plain text files, I think UTF-8 only is not enough.
Mac OS X uses UTF-8 extensively, and it's the default encoding for users' files, and this is the case in most (all?) major Linux distributions too. But on Windows... is Windows-1252 (close but not same as ISO-8859-1) still the default encoding for many languages? At least in Windows XP it was, but I'm not sure if this has changed? In any case, so long as significant number of (mostly Windows) users have the files on their computers encoded in Windows-1252 (or something close to that), supporting UTF-8 only would cause grief and confusion for many.
Some country specific info: in Finland ISO-8859-1 (or 15) is likewise still firmly entrenched. As an example, Finnish IRC channels use, afaik, still mostly Latin-1. (Which means Linux guys with UTF-8 as system default using text-based clients (e.g. irssi) need to do some workarounds / tweak settings.)
I tend to visit Runet websites quite often. Many of them still use Windows-1251 encoding. Also it's the default encoding in Yandex Mail and Mail.ru (two largest webmail services in CIS countries). It's also set as a the default content encoding in Opera browser (2nd after Firefox by popularity in the region) when one downloads it from Russian ip address. I'm not quite sure about other browsers though.
The reason for that is quite simple: UTF-8 requires two bytes to encode Cyrillic letters. Non-unicode encodings require 1 byte only (unlike most Eastern alphabets Cyrillic ones are quite small). They are also fixed-length and easily processable by old ASCII-only tools.
Here are some statistics I was able to find:
This page shows usage statistics for character encodings in "top websites".
This page is another example.
Both of these pages seem to suffer from significant problems:
It is not clear how representative their sample sets are, particularly for non English speaking countries.
It is not clear what methodologies were used to gather the statistics. Are they counting pages or counts of page accesses? What about downloadable / downloaded content.
More important, the statistics are only for web-accessible content. Broader statistics (e.g. for encoding of documents on user's hard drives) do not seem to be obtainable. (This does not surprise me, given how difficult / costly it would be to do the studies needed across many countries.)
In short, your question is not objectively answerable. You might be able to find studies somewhere about how "acceptable" a UTF-8 only application might be in specific countries, but I was not able to find any.
For me, the take away is that it is a good idea to write your applications to be character encoding agnostic, and let the user decide which character encoding to use for storing documents. This is relatively easy to do in modern languages like Java and C#.
Users of CJK characters are biassed against UTF-8 naturally because their characters become 3 bytes each instead of two. Evidently, in China the preference is for their own 2-byte GBK encoding, not UTF-16.
Edit in response to this comment by #Joshua :
And it turns out for most web work the pages would be smaller in UTF-8 anyway as the HTML and javascript characters now encode to one byte.
Response:
The GB.+ encodings and other East Asian encodings are variable length encodings. Bytes with values up to 0x7F are mapped mostly to ASCII (with sometimes minor variations). Some bytes with the high bit set are lead bytes of sequences of 2 to 4 bytes, and others are illegal. Just like UTF-8.
As "HTML and javascript characters" are also ASCII characters, they have ALWAYS been 1 byte, both in those encodings and in UTF-8.
UTF-8 is popular because it is usually more compact than UTF-16, with full fidelity. It also doesn't suffer from the endianness issue of UTF-16.
This makes it a great choice as an interchange format, but because characters encode to varying byte runs (from one to four bytes per character) it isn't always very nice to work with. So it is usually cleaner to reserve UTF-8 for data interchange, and use conversion at the points of entry and exit.
For system-internal storage (including disk files and databases) it is probably cleaner to use a native UTF-16, UTF-16 with some other compression, or some 8-bit "ANSI" encoding. The latter of course limits you to a particular codepage and you can suffer if you're handling multi-lingual text. For processing the data locally you'll probably want some "ANSI" encoding or native UTF-16. Character handling becomes a much simpler problem that way.
So I'd suggest that UTF-8 is popular externally, but rarer internally. Internally UTF-8 seems like a nightmare to work with aside from static text blobs.
Some DBMSs seem to choose to store text blobs as UTF-8 all the time. This offers the advantage of compression (over storing UTF-16) without trying to devise another compression scheme. Because conversion to/from UTF-8 is so common they probably make use of system libraries that are known to work efficiently and reliably.
The biggest problems with "ANSI" schemes are being bound to a single small character set and needing to handle multibyte character set sequences for languages with large alphabets.
While it does not specifically address the question -- UTF-8 is the only character encoding mandatory to implement in all IETF track protocols.
http://www.ietf.org/rfc/rfc2277.txt
You might be interested in this question. I've been trying to build a CW about the support for unicode in various languages.
I'm interested both in statistical
data and the situation in specific
countries.
On W3Techs, we have all these data, but it's perhaps not easy to find:
For example, you get the character encoding distribution of Japanese websites by first selecting the language: Content Languages > Japanese, and then you select Segmentation > Character Encodings. That brings you to this report: Distribution of character encodings among websites that use Japanese. You see: Japanese sites use 49% SHIFT-JIS and 38% UTF-8. You can do the same per top level domain, say all .jp sites.
Both Java and C# use UTF-16 internally and can easily translate to other encodings; they're pretty well entrenched in the enterprise world.
I'd say accepting only UTF as input is not that big a deal these days; go for it.
I'm interested both in statistical
data and the situation in specific
countries.
I think this is much more dependent on the problem domain and its history, then on the country in which an application is used.
If you're building an application for which all your competitors are outputting in e.g. ISO-8859-1 (or have been for the majority of the last 10 years), I think all your (potential) clients would expect you to open such files without much hassle.
That said, I don't think most of the time there's still a need to output anything but UTF-8 encoded files. Most programs cope these days, but once again, YMMV depending on your target market.

Detecting a (naughty or nice) URL or link in a text string

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I'm thinking.
The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
Well-formed URLs
All non-whitespace followed by '.' followed by any valid TLD
Anything else?
Related Questions
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
replace URL with HTML Links javascript
What is the best regular expression to check if a string is a valid URL
Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
#Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per #capar
A good dot-looking character is #Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
Strip out all dotted-quads (#Sharkey's comment to his own answer)
#Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per #Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
Things this will get:
buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.
I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.
If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.
I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?
The following paragraph is an example:
This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)
Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".
It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.
Since you are primarily looking for invitations to copy and paste into a browser address bar, it might be worth taking a look at the code used in open source browsers (such as Chrome or Mozilla) to decide if the text entered into the "address bar equivalent" is a search query or a URL navigation attempt.
Ping the possible URL
If you don't mind a little server side computation, what about something like this?
urls = []
for possible_url in extracted_urls(comment):
if pingable(possible_url):
urls.append(url) #you could do this as a list comprehension, but OP may not know python
Here:
extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.
[ramanujan:~/base]$ping -c 1 www.google.com
PING www.l.google.com (74.125.19.147): 56 data bytes
64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms
--- www.l.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms
[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com
ping: cannot resolve fooalksdflajkd.com: Unknown host
The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.
Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.
However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.
Of course you realize if spammers decide to use tinuyrl or such services to shorten their URLs you're problem just got worse. You might have to write some code to look up the actual URLs in that case, using a service like TinyURL decoder
Consider incorporating the OWASP AntiSAMY API...
I like capar's answer the best so far, but dealing with unicode fonts can be a bit fraught, with older browsers often displaying a funny thing or a little box ... and the location of the U+05B4 is a bit odd ... for me, it appears outside the pipes here |ִ| even though it's between them.
There's a handy · (·) though, which breaks cut and paste in the same way. Its vertical alignment can be corrected by <sub>ing it, eg:
stackoverflow·com
Perverse, but effective in FF3 anyway, it can't be cut-and-pasted as a URL. The <sub> is actually quite nice as it makes it visually obvious why the URL can't be pasted.
Dots which aren't in suspected URLs can be left alone, so for example you could do
s/\b\.\b/<sub>·<\/sub>/g
Another option is to insert some kind of zero-width entity next to suspect dots, but things like ‍ and ‌ and &ampzwsp; don't seem to work in FF3.
There's already some great answers in here, so I won't post more. I will give a couple of gotchas though. First, make sure to test for known protocols, anything else may be naughty. As someone whose hobby concerns telnet links, you will probably want to include more than http(s) in your search, but may want to prevent say aim: or some other urls. Second, is that many people will delimit their links in angle-brackets (gt/lt) like <http://theroughnecks.net> or in parens "(url)" and there's nothing worse than clicking a link, and having the closing > or ) go allong with the rest of the url.
P.S. sorry for the self-referencing plugs ;)
I needed just the detection of simple http urls with/out protocol, assuming that either the protocol is given or a 'www' prefix. I found the above mentioned link quite helpful, but in the end I came out with this:
http(s?)://(\S+\.)+\S+|www\d?\.(\S+\.)+\S+
This does, obviously, not test compliance to the dns standard.
Given the messes of "other funny business" that I see in Disqus comment spam in the form of look-alike characters, the first thing you'll want to do is deal with that.
Luckily, the Unicode people have you covered. Dig up an implementation of the TR39 Skeleton Algorithm for Unicode Confusables in your programming language of choice and pair it with some Unicode normalization and Unicode-aware upper/lower-casing.
The skeleton algorithm uses a lookup table maintained by the Unicode people to do something conceptually similar to case-folding.
(The output may not use sensible characters, but, if you apply it to both sides of the comparison, you'll get a match if the characters are visually similar enough for a human to get the intent.)
Here's an example from this Java implementation:
// Skeleton representations of unicode strings containing
// confusable characters are equal
skeleton("paypal").equals(skeleton("paypal")); // true
skeleton("paypal").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("paypal").equals(skeleton("ρ⍺у𝓅𝒂ן")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
// The skeleton representation does not transform case
skeleton("payPal").equals(skeleton("paypal")); // false
// The skeleton representation does not remove diacritics
skeleton("paypal").equals(skeleton("pàỳpąl")); // false
(As you can see, you'll want to do some other normalization first.)
Given that you're doing URL detection for the purpose of judging whether something's spam, this is probably one of those uncommon situations where it'd be safe to start by normalizing the Unicode to NFKD and then stripping codepoints declared to be combining characters.
(You'd then want to normalize the case before feeding them to the skeleton algorithm.)
I'd advise that you do one of the following:
Write your code to run a confusables check both before and after the characters get decomposed, in case things are considered confusables before being decomposed but not after, and check both uppercased and lowercased strings in case the confusables tables aren't symmetrical between the upper and lowercase forms.
Investigate whether #1 is actually a concern (no need to waste CPU time if it isn't) by writing a little script to inspect the Unicode tables and identify any codepoints where decomposing or lowercasing/uppercasing a pair of characters changes whether they're considered confusable with each other.

Are you fluent in Unicode yet?

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?
I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.
So for my benefit (and I believe many others) can I get some input from people on the following:
How to "get over" ASCII once and for all
Fundamental guidance when working with Unicode.
Recommended (recent) books and websites on Unicode (for developers).
Current state of Unicode (5 years after Joels' article)
Future directions.
I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.
Update: See this related question also asked on StackOverflow previously.
Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.
Here some interesting articles (besides Joel's article) on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
A quote from the first article; Tips for using Unicode:
Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
Spend some time poking around the Unicode web site and learning how the code charts work.
If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.
I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.
Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.
The .NET Framework uses Windows default encoding for storing strings, which turns out to be UTF-16. If you don't specify an encoding when you use most text I/O classes, you will write UTF-8 with no BOM and read by first checking for a BOM then assuming UTF-8 (I know for sure StreamReader and StreamWriter behave this way.) This is pretty safe for "dumb" text editors that won't understand a BOM but kind of cruddy for smarter ones that could display UTF-8 or the situation where you're actually writing characters outside the standard ASCII range.
Normally this is invisible, but it can rear its head in interesting ways. Yesterday I was working with someone who was using XML serialization to serialize an object to a string using a StringWriter, and he couldn't figure out why the encoding was always UTF-16. Since a string in memory is going to be UTF-16 and that is enforced by .NET, that's the only thing the XML serialization framework could do.
So, when I'm writing something that isn't just a throwaway tool, I specify a UTF-8 encoding with a BOM. Technically in .NET you will always be accidentally Unicode aware, but only if your user knows to detect your encoding as UTF-8.
It makes me cry a little every time I see someone ask, "How do I get the bytes of a string?" and the suggested solution uses Encoding.ASCII.GetBytes() :(
Rule of thumb: if you never munge or look inside a string and instead treat it strictly as a blob of data, you'll be much better off.
Even doing something as simple as splitting words or lowercasing strings becomes tough if you want to do it "the Unicode way".
And if you want to do it "the Unicode way", you'll need an awfully good library. This stuff is incredibly complex.