Unicode Comparing in PHP/MySQL - mysql

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?

By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.

First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.

Related

MySQL to postgres migration issue

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?
utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:
Requires a maximum of four bytes per multibyte character.
So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.
But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

Does MySQL server implicitly support encoding conversion?

I need to implement a sorted SELECT, on a specific encoding of a field, without CONVERT.
That is, normally I'd do it by
SELECT * FROM table ORDER BY CONVERT(field USING gbk) COLLATE gbk_chinese_ci
However for some reason CONVERT was not allowed. As a result, I tried to approach this by
ALTER TABLE table MODIFY field VARCHAR(xx) CHARACTER SET gbk COLLATE gbk_chinese_ci;
SELECT * FROM table ORDER BY field
It works. That's good. However I'm worried about encoding problems.
Connection to the MySQL server includes the parameters characterEncoding=utf8 and useUnicode=true. I couldn't yet find the explanation of these params in MySQL's official document, but I suppose these ensure that the communications between the client and the server should be in utf-8.
That brings the question. Does MySQL server implicitly convert data in utf-8 to gbk when it receives the data? Do the GET params only define the charset of communication rather than that of the final stored data?
Edit
Comments say that the server does convert them! Thanks guys!
My further confusion is that, only one of the fields is set to use gbk, while everything else has been left to use utf8. That means the server's charset should still be utf8 globally but gbk locally for that field only.
Suppose now I fire this line of script to the server
INSERT INTO table (field_gbk, field_utf8) VALUES ("a", "b");
Does the server:
Receive the whole statement in utf8;
Convert only "a" to gbk and stores it; and
Stores "b" as-is to the database?
Many thanks guys!
Yes.
You specify the encoding of in the client when you connect.
You specify the encoding ("Character set") of the column you are Inserting into.
MySQL converts from one encoding to the other as it INSERTs the rows. Similarly, it converts the other way when SELECTing.
The CONVERT function should not (normally) be used for anything.
You are using Java? characterEncoding=utf8 and useUnicode=true is what it uses for declaring the client side.
"gbk" for a single column? Find. That column will handled differently than other columns.

What should be the correct MySql collation store in this case?

I'm storing strings on a Mysql database.
Some of the strings have single quotes which then get stored like this:
People’s
Is this the proper way to store these strings or should I set a different mysql collation?
I have tried the following without luck....
utf8_general_ci
latin1_swedish_ci
Where are you setting the collation? You should be using UTF-8 in three places:
as the collation of each row that contains character data. You can set the default collation for the table or database so that new columns pick it up, but if you already have a table, ALTERing its default collation doesn't change the collation of the existing rows.
as the encoding of the connection between your application and MySQL. This can be set manually using the SET NAMES statement, or, better, with the suitable API call for your environment (for example mysql_set_charset() in PHP, or the charset argument to connect() in Python MySQLdb).
in your output. For example if producing a web page, by using the Content-Type: text/html;charset=utf-8 header/meta.
You can store the string "People’s" as UTF-8-hidden-in-Latin-1 "People’s" by using Latin-1 throughout, since you'll still get the same bytes out as you put in. But that way you won't get sensible results from ordering or case-insenstive-comparisons of non-ASCII characters.

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.

SSIS - Data from MySql to MsSql some characters are?

I just transferred some data from MySql to MsSql (2K5) in a text field, some of my characters, such as apostrophes, are now ? (question mark) to me this indicates some sort of collation or character set error, right?
To be honest, I don't know which one should I be using
The MySql db currect charset is utf8_general_ci and in ms sql is SQL_Latin1_General_CP1_CI_AS .
I have tried changing the charset of the mysql table to latin1_swedish_ci, however this doesnt help
Thanks for the input
Have you tried changing the target (SQL Server) column data type to NVARCHAR?
The utf8_general_ci collation on the MySQL column indicates a Unicode data type. If the source is Unicode, so should be the target - for the easiest transition.
Collations themselves play a minor role here. They just affect comparison and sorting.
You might also need to check the SSIS type of the columns in your dataflow. Remember, the data type and character set is set at the connection manager on the source (and that may involve a conversion from the original native character set). Also, any operations like derived columns or conversions will have a character set which can be altered and will persist down that column's lineage in the data flow. At the end when it gets to the destination, there could be additional character set coercion/conversion.