run mysql without collation (utf-8 only) - mysql

I run a sqlite3 database with utf8-strings from many languages. For various reasons I want to move to mysql, but I constantly run into trouble because of the mysql-collation feature.
One problem is that I am not even able to reliably know what is in my database. (For example I get "?" for non-latin characters and "�" for latin-based characters like öé, etc. - but I have absolutely no idea whether the problem lies in the import from sqlite3 to mysql or in reading from the mysql-database.)
Is there a way to get rid of this "feature" and let mysql do what I tell it without trying to be smart? I use UTF-8 everywhere and I never need any mangling of strings: Input is always UTF-8 and output should be always UTF-8. Also I really would like to know what really is stored in the database - i.e. without a collation-feature corrupting the data during readout.

You could use the MySQL VARBINARY column type, which stores a sequence of arbitrary bytes without interpreting them in any particular charset (or maybe VARCHAR BINARY, which is subtly different).

MySQL uses latin1_swedish_ci unless you specify something different explicitly. That's the opposite of smart. You have to be smart and change that default. This can be done with e.g. the --character-set-server and --collation-server command line options. See Specifying Character Sets and Collations for other means and further options.

Related

MySQL to postgres migration issue

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?
utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:
Requires a maximum of four bytes per multibyte character.
So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.
But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

MySQL encoding weird characters

As many people already had, I have a problem in MySQL with the encoding of my data.
More specifically, the collation of the table seems to be utf8_general_ci. The data inserted is inserted well, but when a select is done, some characters get translated badly:
Marie-Thérèse becomes Marie-Thérèse.
Is it possible to do a select and translate these characters back to the original value, or is it impossible? It's harder to change the original table in my case, so I'd rather solve it in my select query.
When using phpmyadmin (or the like) and looking at those entries, are those entries okay?
update: if not, the inserts are probably flawed already, and the connection from the insertion script must be adapted.
If so, then it's not technically MySQL's fault but the software connecting to it. See for example: UTF-8 all the way through . You have to set some parameters on/after opening the connection.
btw: The collation should be irrelevant. http://dev.mysql.com/doc/refman/5.7/en/charset-general.html
The gist is: a collation tells the you, how you have to order/compare strings, which is mainly important for special characters like äöü in German or àéô in French/... because their local/regional collation say, ä is - for ordering purposes - exactly like a (for example), in another collation, ä could be distinctly after a or even after z.
In the end it seems like the problem was with running it all through a cronbjob.
We run a script through a cronjob that generates the insert statements. Apparently, when running the script manually, everything goes well, but when running the same script through a cronjob, the data got messed up. We solved that by this article:http://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/
We had to add a variable LANG in the etc/environment file.

Mysql common ascii character collation

I am getting an error like this:
COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'utf8'
Whenever I try to run a particular query. The problem in my case is, that I need this query to be able to run - without modification - against two separate databases, which have a different character collation (one is latin1, the other is utf8).
Since the strings I am trying to match are guaranteed to be basic letters (a-z), I was wondering if there was any way to force the comparison to work irrespective of the specific encoding?
I mean, a A is an A no matter how it is encoded - is there some way to tell mysql to compare the content of the string as letters rather than as whatever binary thing it does internally? I don't even understand why it can't auto-convert collations, since it is quite capable of doing it when explicitly told to.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.