How efficient is mySql full text search for non English languages.
I am starting a project and had chosen postgres due to its great support to full text search in multiple languages. In postgres I can specify the language in the full text search to get the best out of it.
Does mySql have something similar? Anyone using the mySql full text search in non English languages?
It does NOT work well with Chinese or Thai languages as of version 8.
The built-in MySQL full-text parser uses the whitespace between words as delimiters. Some languages like Chinese, Japanese, Korean, Thai, Khmer use writing systems that don't commonly use whitespace between individual words.
MySQL has a MeCab Parser to address the problem. But, I haven't tried it.
Related
When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: é
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.
Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ?
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.
I have a website that has multiple translations. Everything is working fine for Chinese, Japanese and other languages. For some reason when we add some Portuguese characters it replaces with ? marks.
Any way to prevent that?
This means you are using a different encoding between your site and the database, It is recommended changing your encoding to UTF8 in the Html Headers, Meta encoding Tags and Database.
This is a good article about this topic.
Handling Unicode Front to Back in a Web App
I have request from a customer to develop a website on english,greek and chinese language. While i know for sure that utf8_general_ci will do for the greek and english, i am not sure if it will work for chinese language.
So question is: can i use utf8_general_ci enconding for the chinese language, or i have to make separate set of tables with different encoding?
Regards, Zoran
UTF-8 supports practically every language, but more correctly, it supports practically every script. It will work for English, Greek, and Chinese. You might need to convert the encoding at some points since some things use different encodings for eastern languages, but the database will be fine as long as everything it gets is in UTF-8.
Iam using PB 10.5.2 and EZTwain 3.30.0.28, XDefs 1.36b1 by Dosadi for scanning.
Also Iam using the TOCR 3.0 for OCR management.
In a function we use the following among all others :
...
Long ll_acquire
(as_path_filename is a function argument)
...
...
TWAIN_SetAutoOCR(1)
ll_acquire = TWAIN_AcquireMultipageFile(0, as_path_filename)
the problem is that the scanned pdf page has latin (english) and greek words.
The English characters are searched quite precisely but the greek don't at all.
Do you think this that this has to do with the TOCR software.
I just want to search AND for greek words
Thanks in advance
The OCR software should be where it is failing to convert the Greek words into OCR'd text. It looks like you are using EZTwain for the OCR portion which uses TOCR for its actual OCR engine. You may want to look at the docs for that software and see if they mention any settings that can be modified for multilingual usage.
According to the website TOCR recognizes English, French, Italian, German, Dutch, Swedish, Finnish, Norwegian, Danish, Spanish and Portuguese. You'll need software that can handle mixed Greek and English text. ABBYY FineReader Professional lists support for English and Greek, along with dozens of others.
I just wanted to develop a translation app in a Django projects which enables registered users with certain permissions to translate every single message it appears in latest version.
My question is, what character set should I use for database tables in this translation app? Looks like some european language characters cannot be stored in UTF-8?
Looks like some european language characters cannot be stored in UTF-8?
Not true. UTF-8 can store any character set without limitations except maybe for Klingon. UTF-8 is your one stop shop for internationalization. If you have problems with characters, they are most likely to be encoding problems, or missing support for that character range in the font you're using to display the data with (Extremely unlikely for a european language character though, but common e.g. when viewing indian sites on an european computer. See also this question)
If a non-western character set can't be rendered, it could be that the user's built in font does not have that range of UTF-8 covered.
Update: Klingon it is indeed not part of official UTF-8:
Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon) are listed in the ConScript Unicode Registry, along with unofficial but widely-used Private Use Area code assignments.
However, there is a volunteer project that has inofficially assigned code points F8D0-F8FF in the private area to Klingon. Gallery of Klingon characters
UTF-8 can be used to represent all of Unicode, so it doesn't let you express all common languages. It allows you to express all languages.
If it seems as if some european characters aren't working, that's an encoding issue.