UTF8 letters issue in latin mysql database - mysql

I have latin1 MySQL database and it is too late to convert it to utf8.
When I search for text that contains (French letter for example), I get same result with English letter.
Example: when I search for "tést", I get "test" from MySQL.
How can I avoid this?
Thank you.

The COLLATION latin1_general_ci will find test when searching for tést. Since that is probably what your collation is (let's see SHOW CREATE TABLE), that is no efficient way to avoid getting test.
If all you have is Western European characters, utf8 is not a critical goal.
To change the table foo to that collation:
ALTER TABLE foo CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
Since it will involve copying the entire table, it will take some time.

Related

How can I determine utf8 data encoding error and correct it in MySql?

I have a website form written in Perl that saves user input in multiple languages to a MySQL database. While it has worked perfectly saving and displaying all characters without problems, in PHPMyAdmin the characters always displayed with errors. However I ignored this since the website was displaying characters OK.
Now I've just recently moved the website to a VPS and the database has seemingly enforced ut8mb4 encoding on the data, so it is now displaying character errors on the site. I'm not an expert and find the whole encoding area quite confusing. My question is, how can I:
a) determine how my data is actually encoded in my table?
b) convert it correctly to utf8mb4 so it displays correctly in PHPMyAdmin and my website?
All HTML pages use the charset=utf8 declaration. MySQL connection uses mysql_enable_utf8 => 1. The table in my original database was set to utf8_general_ci collation. The original database collation (I just noticed) was set to latin1_swedish_ci. The new database AND table collation is utf8mb4_general_ci. Thanks in advance.
SHOW CREATE TABLE will tell you the default CHARACTER SET for the table. For any column(s) that overrode the default, the column will specify what it is set to.
However, there could be garbage in the column. Many users have encountered this problem when they stored utf8 bytes into a latin1 column. This lead to "Mojobake" or "double encoding".
The only way to tell what is actually stored there is to SELECT HEX(col). Western European accented characters will be
one byte for a latin1 character stored in latin1 column.
2 bytes for a utf8 character stored in 1 utf8 character or into 2 latin1 characters.
several bytes for "double encoding" when converted twice.
More discussion: Trouble with UTF-8 characters; what I see is not what I stored

Why my MySQL DB can store Arabic characters correctly with latin1 encoding?

Test SELECT:
MySQL [chuangwai]> select ar_detail from items limit 1\G;
*************************** 1. row ***************************
ar_detail: {"طراز": "فساتين قفطان", "المواد": "الشيفون"}
and you can see the Arabic characters displayed correctly.
Then I check encoding:
MySQL [chuangwai]> select * from information_schema.SCHEMATA\G;
*************************** 2. row ***************************
CATALOG_NAME: def
SCHEMA_NAME: chuangwai
DEFAULT_CHARACTER_SET_NAME: latin1
DEFAULT_COLLATION_NAME: latin1_swedish_ci
SQL_PATH: NULL
In another SO post, BalusC said:
If you're trying to store non-Latin characters like Chinese, Japanese,
Hebrew, Cyrillic, etc using Latin1 encoding, then they will end up as
mojibake.
As you see, it is not my case. Could anyone please give me an explanation why I can store Arabic characters with latin1 encoding? Is it necessary for us to switch the encoding of our DB from latin1 to uft8?
EDIT: Okay, I just found the encoding of items is uft8...
MySQL [chuangwai]> SELECT TABLE_COLLATION
-> FROM INFORMATION_SCHEMA.TABLES
-> WHERE TABLE_NAME = 'items';
+-----------------+
| TABLE_COLLATION |
+-----------------+
| utf8_unicode_ci |
+-----------------+
Most likely explanation is that your table is UTF8, even if your schema is ASCII. Try
SELECT TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'items';
In my case, a UTF8 table gives me: utf8_general_ci. You might see utf8mb4_general_ci instead (that's actually better than utf8_general_ci for a variety of reasons)
Now, as to your question "is it necessary to switch encodings?" The answer is "technically, no, but it would probably be a good idea." So long as you include encoding in your table definitions, then you won't need to worry about the schema encoding. Still, it would be better to switch encoding so that you don't need to worry about accidentally munging data later.
Please provide SHOW CREATE TABLE. It may be that the table's default is one thing, but the columns are another.
You need to announce to MySQL that the bytes you have in the client are utf8. (They cannot be latin1, much less ascii, since those charsets do not have the characters in question.)
You need the column to be declared CHARACTER SET utf8 (or utf8mb4). Then all will be well.
But you managed to get somewhere with latin1? Well, that is an accident.
Case 1: You lie about what is in the client and what to store in the table columns. But latin1 is forgiving; it essentially stores bytes without regard for what they mean.
Case 2: You get "double encoding", and the characters end up storing as 4 bytes. But they magically come back out looking OK.
Case 3: Mojibake is another way to do things wrong. But since the text is retrieved intact, I don't think you have this case.
Case... (There are other cases; see the link below.)
In any case, ORDER BY and WHERE are likely to sort or filter things incorrectly.
See "Best Practice" in http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored

Relationship between database's charset, table's charset and columns' charset? Is diffrent charsets lead to any performance issues?

I am developing a website by using ASP.net and my DB is MYSQL. In there users can submit articles. This site goes internationally so I dont want to restrict the language only to English.
So I decided few things. Please guide me If I made the wrong choice.
1) I choose utf8mb4 as database charset. Because it is an improved version of UTF8 for store further characters. Am I made the right choice? I mean I have only few tables where need to use utf8mb4. So Shall I use Latin1 as Database charset?
2) I dont have an idea which collation to use for above charset. I decided to use utf8mb4 swedish_ci. Or should I use general Ci or any other?
3) In my tables most of tables not needed utf8mb4 charset. Latin 1 swedesh will do the work. So can I maintain selected tables under specific charset and collation even DB is in another Charset and collation?
4) Can I use utf8mb4 charset for a specific column in a table which have Latin1 swedesh as charset?
If those can do what is the relationship between database charset, table charset and column charsets?
Is different charsets lead to any performance issues?
Thank you very much.
The database charset is inherited by the table, unless you override it. (I recommend being specific at the table level.)
The table charset is inherited by the columns in the table. Since one usually has only one charset, this inheritance is fine. Also, it is pretty clear when you do SHOW CREATE TABLE what each column is set to -- without having to look at the database or system.
Go international -- use utf8 or utf8mb4. I agree that utf8mb4 is a better choice, especially for Chinese and some emoticons.
character_set_% -- Only _client, _connection, and _results are important. And these are the three that are set by SET NAMES utf8mb4. Leave the rest alone.
The default collation for utf8mb4 is utf8mb4_general_ci, which is possibly a good choice if you have multiple languages. The other choice is utf8mb4_unicode_ci . I talk more about "combining diacriticals" in http://mysql.rjweb.org/doc.php/charcoll#combining_diacriticals . This section gives examples of where those two collations differ: http://mysql.rjweb.org/doc.php/charcoll#utf8_collations_examples
See also the "Best Practice" section.
latin1 is smaller than utf8 for Western European text. MySQL will do the proper conversions when needed, so that is not a problem. But I prefer not to confuse the programmer by mixing character sets. Keep in mind that converting an existing table column from latin1 to utf8 takes some effort, possible downtime, and maybe risk.
4) Can I use utf8mb4 charset for a specific column in a table which have Latin1 swedesh as charset?
Yes. Each column (but not each row) can have a different character set and/or collation.
The existence of different charsets is not a performance, per se. What could bite you is WHERE col1 = col2 (and other cases) when the two columns have a different character set and/or collation. MySQL will abandon an otherwise perfectly good index if it sees a difference that is not easy to handle.

How to handle multilingual MySQL queries?

I have a DB which stores usernames, passwords, and basic info. The 'info' however, only stores English characters.
How to store data in different languages, let's say English, French, Russian, Chinese, Japanese, Arabic, etc.? I realized that default collation doesn't support that.
What is the best solution, and how do you guys get around it?
Change the default collation of the whole database and also of the table(s) to utf8_general_ci. There is no reason to suffer (with this kind of free form data).
ALTER DATABASE db CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE tbl CONVERT TO utf8
ALTER TABLE tbl CHARACTER SET utf8 COLLATE utf8_general_ci;
Read about a few gotchas at the end of this page.

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.