How to import SQL database if encoding is unknown - mysql

I have a database.sql file and I need to deploy it on my MySQLServer.
So I open the workbench and I open the file and the query appears.
This database has translations in lots of languages and some of them are ruined:
And some words in Chineese, Urdu, Indian etc show like this:
I'm not used to encodings but I don't know how to import it to restore that data to the original charset. Thank you un advance.

That's Mojibake. Without knowing the steps that the data went through, I cannot diagnose it completely. Several things needed to say utf8mb4 (or at least utf8). See this for what to check on and 'best practice': Trouble with UTF-8 characters; what I see is not what I stored

Related

How to get mysqldump to export a database in the right encoding and collation?

I'm having a problem with the encoding when dumping a database using mysqldump.
The issue is that the file being generated is breaking non-ASCII characters (for ex. german and spanish characters). The data in the DB is right, but it is exported wrong.
I have tried the following:
using --default-character-set to utf8, utf8mb4, and latin1 (the last option because although the tables are using utf8_general_ci collation, the database itself is set to latin1, I don't know why). Weirdly enough, the output differs in filesize, but the content (specially the problematic characters) shows the same issue in all three cases. As if the option would be ignored.
importing the dumped file into a new mysql service, but since the characters are broken in the file, the import is also broken. for ex. the dump with the utf8mb4 option is imported in a fresh database with character encoding utf8mb4, but since the source file is wrongly encoded, it is not being "transcoded back" to a right form.
Initially I thought that it could be an issue with the version of the mysql server being different (5.7 in the source, 8.0 in the destination server), but since the file seems to be already broken, I now think that this might not be the root-cause. Still lost, so I prefer to mention it just in case it helps.
An example of the sentence I'm running:
mysqldump --default-character-set=utf8mb4 --no-tablespaces -u database_user -p database_name > /home/username/database_name-utf8mb4-20220712.sql
No errors appear neither during the export nor during the import in the new server. Everything seems to run smooth, but the character encoding is messed up, so something isn't OK.
Any support is much appreciated. Thank you!
but the character encoding is messed up
Give us an example. Include a hex dump of a small portion of the file where garbage shows up.
It is likely that the original data was either in character set utf8 or latin1, but the dumping and/or reloading specified the wrong character set. Please provide more details of the dump and load.
Also see: Trouble with UTF-8 characters; what I see is not what I stored

mysqldump routines charset encoding problem

How could I possibly dump MySQL database structure and data on windows with mysqldump with polish characters ("ęóąśłżźćń") included?
So far I've managed to dump it altogether using mysqldump.exe <my_settings> --default-character-set=cp1250. It appears to solve at least my data inserts encoding problem since I've set it to cp1250 (Windows Central European) instead of latin2.
The problematic phrases are within my db structure code. For instance: all my stored procedures and functions contain these "special" characters in their comments. I believe, for some reason, they are interpreted as utf8 instead of cp1250. No matter what encoding I set, my comments stay intact.
I believe there must be some other separate setting for routines charset encoding I'd missed. I know it's possible to achieve since I dumped it with workbench data export and somehow it worked. Sadly I wasn't able to check cnf file content since it disappears right afterwards.
Any help would be much appreciated. Especially one excluding potential script conversions.
Cheers
Drop the stored routines, SET NAMES to the desired charset, re-CREATE the routines.
Confirm with SHOW CREATE PROCEDURE name and look at the charset given at the end.

UTF-8 encoding problem while importing a sql file

I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.

Liferay does not display UTF8 characters anymore

I just made a database restore from mysql workbench and found out that liferay does not display UTF-8 spec characters e.g ÅÄÖ, these letters are instead displayed as a question mark.
I wonder if anyone knows the solution for this issue? Do I have to specify a charset while importing the sql files? And if so how do I do that in mysql workbench?
To be honest I have no idea if the mysql restore has a direct effect on what happened, I'm just describing what I did before the issue occurred.
If you do a restore to a new database, make sure that this database defaults the character set to UTF-8:
create database lportal character set utf8;
Then import your data into that table.
Let me also use this opportunity to link my favourite site to generate great UTF-8 testdata: http://www.fliptitle.com - great if you need testdata for people who know only ASCII languages but still need immediate feedback on correct encoding with data they're able to interpret. You don't seem to be one of them, but I guess others that are in this group might stumble upon this later.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.