Does MySQL server implicitly support encoding conversion? - mysql

I need to implement a sorted SELECT, on a specific encoding of a field, without CONVERT.
That is, normally I'd do it by
SELECT * FROM table ORDER BY CONVERT(field USING gbk) COLLATE gbk_chinese_ci
However for some reason CONVERT was not allowed. As a result, I tried to approach this by
ALTER TABLE table MODIFY field VARCHAR(xx) CHARACTER SET gbk COLLATE gbk_chinese_ci;
SELECT * FROM table ORDER BY field
It works. That's good. However I'm worried about encoding problems.
Connection to the MySQL server includes the parameters characterEncoding=utf8 and useUnicode=true. I couldn't yet find the explanation of these params in MySQL's official document, but I suppose these ensure that the communications between the client and the server should be in utf-8.
That brings the question. Does MySQL server implicitly convert data in utf-8 to gbk when it receives the data? Do the GET params only define the charset of communication rather than that of the final stored data?
Edit
Comments say that the server does convert them! Thanks guys!
My further confusion is that, only one of the fields is set to use gbk, while everything else has been left to use utf8. That means the server's charset should still be utf8 globally but gbk locally for that field only.
Suppose now I fire this line of script to the server
INSERT INTO table (field_gbk, field_utf8) VALUES ("a", "b");
Does the server:
Receive the whole statement in utf8;
Convert only "a" to gbk and stores it; and
Stores "b" as-is to the database?
Many thanks guys!

Yes.
You specify the encoding of in the client when you connect.
You specify the encoding ("Character set") of the column you are Inserting into.
MySQL converts from one encoding to the other as it INSERTs the rows. Similarly, it converts the other way when SELECTing.
The CONVERT function should not (normally) be used for anything.
You are using Java? characterEncoding=utf8 and useUnicode=true is what it uses for declaring the client side.
"gbk" for a single column? Find. That column will handled differently than other columns.

Related

How to find out mysql field level charset?

I need to convert latin1 charset of a table to utf8.
Quoting from mysql docs:
The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8mb4). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8mb4;
This answer shows how to find out charset at DB level, table level, and column level. But I need to find out the charset of the actual stored values. How can I do that?
Since my connector/j jdbc connection string doesn't specify any characterEncoding or connectionCollation properties, it is possible that it used utf8 by default to store the values, in which case I don't need any conversion, just change the table metadata.
mysql-connector-java version: 8.0.22
mysql database version: 5.6
spring boot version: 2.5.x
The character set of the string in a given column should be the same as the column definition.
There have been cases where people accidentally store the bytes of the wrong encoding in a column. For example, they store bytes of a latin1 encoding in a utf8 field. This is a terrible idea, because queries can't tell the difference. Those bytes may not be valid values of the column's defined encoding, and this results in garbage data. Cleaning up a table where some of the strings are stored in the wrong encoding is an unpleasant chore.
So I strongly urge you to store only strings encoded in a compatible way according to the column's definition, and to assume that all strings are stored this way.
To answer the title:
SHOW CREATE TABLE tablename shows the detault charset for the table and any overrides for individual columns.
Don't blindly use CONVERT TO, especially the 2-step ALTER you are showing. Let's see what is in the table now (SELECT col, HEX(col) ... for something with accented text.
See Trouble with UTF-8 characters; what I see is not what I stored for the main 4 types of problems.
This gives several cases and how to fix them. http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
One case involves using CONVERT TO; two other cases involve using BLOB or VARBINARY.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

What should be the correct MySql collation store in this case?

I'm storing strings on a Mysql database.
Some of the strings have single quotes which then get stored like this:
People’s
Is this the proper way to store these strings or should I set a different mysql collation?
I have tried the following without luck....
utf8_general_ci
latin1_swedish_ci
Where are you setting the collation? You should be using UTF-8 in three places:
as the collation of each row that contains character data. You can set the default collation for the table or database so that new columns pick it up, but if you already have a table, ALTERing its default collation doesn't change the collation of the existing rows.
as the encoding of the connection between your application and MySQL. This can be set manually using the SET NAMES statement, or, better, with the suitable API call for your environment (for example mysql_set_charset() in PHP, or the charset argument to connect() in Python MySQLdb).
in your output. For example if producing a web page, by using the Content-Type: text/html;charset=utf-8 header/meta.
You can store the string "People’s" as UTF-8-hidden-in-Latin-1 "People’s" by using Latin-1 throughout, since you'll still get the same bytes out as you put in. But that way you won't get sensible results from ordering or case-insenstive-comparisons of non-ASCII characters.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.

Mysql turns ' into ’?

How can I stop mysql from converting ' into ’ when I do an insert?
i believe it has something to do with charset or something?
I am using php to do the mysql_insert.
The single quotation mark you posted is called an 'acute accent', which is often converted from the generic single quotation mark by some web applications. It's a UTF8 character, which when inserted into a Latin-1 database translates to '’'. This means that you need to change MySQL's charset to UTF8, or alternatively change your website's charset to Latin-1. The former would be preferred:
ALTER DATABASE YourDatabase CHARACTER SET utf8;
ALTER TABLE YourTableOne CONVERT TO CHARACTER SET utf8;
ALTER TABLE YourTableTwo CONVERT TO CHARACTER SET utf8;
...
ALTER TABLE YourTableN CONVERT TO CHARACTER SET utf8;
Maybe someone will know the answer immediately, but I don't. However here are a few suggestions on what to examine (and possibly expand the question on)
When dealing with encodings and escaping you should include the full history of data
how was it created
what happened to it before the problem (did it have to go through backup, e-mail, was it created on a different server, OS, etc..; if it was transferred then was it as text file?)
The above is because anything that writes to a text file (browser, mysql client, web server, php application, to name a few layers that could have done it) can mess up character coding.
To troubleshoot, you can start eliminating, and thus the first step (in my book), is to
connect to mysql server using mysql command line client.
check the output of SHOW VARIABLES LIKE 'character_set%'
(so even in this simple environment you have 7 values that can influence how the data is parsed, stored and/or displayed
inspect SHOW CREATE TABLE TableName, and look for charset and collation info, both default for the table and explicit definition on columns
Having said all of the above, I don't think any western script would transcode a single quote character. So you might need to look at your escaping and other data processing.
EDIT
Most of the above from answer and discussion here
This is what I've done, and it worked for me:
First make sure that column containing ' is utf8_general_ci
Then add the mysql_set_charset to your code
$db=mysql_connect("localhost", $your_username, $your_password);
mysql_set_charset('utf8',$db);
mysql_select_db($your_db_name, $db);