MySQL to postgres migration issue - mysql

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?

utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:
Requires a maximum of four bytes per multibyte character.
So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.
But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

Related

MySQL Collation in Kamailio tables

For some reason I don't know, I have some Kamailio tables with utf32_general_ci collation (location in example) and others with latin1_swedish_ci.
I try to find answer on github repos (https://github.com/kamailio/kamailio/blob/4.4/scripts/mysql/my_create.sql) but collation is not specified.
That's right? Does matters? Which one it's right choice?
My Kamailio version is 4.4.3.
What is Kamailio? What does it store in a database?
In general, all software today should be using full UTF-8 characterset encoding. In MySQL, the syntax is CHARACTER SET utf8mb4.
You mentioned two COLLATIONs; they have to do with sort order, not character encoding. The two mentioned correspond to character sets
utf32 -- 4 bytes per character quite wasteful of space; virtually no one uses it; there is no good reason (that I can think of) to use it today.
latin1 -- 1 byte per character; handles western Europe, but none of Asia. It is an old default in MySQL (hence your tables having it).
I do not know the ramifications of trying to change Kamailio.
One thing to note in how MySQL deals with character set differences: You must specify the encoding in the client when connecting to MySQL. Then, when INSERTing and SELECTing, MySQL will convert (if necessary, and if possible) between the client encoding and column encoding. Because of this automatic conversion, you will currently see nothing wrong for western European characters. But Greek, Chinese, etc will be mangled/lost when attempting to store into latin1.

run mysql without collation (utf-8 only)

I run a sqlite3 database with utf8-strings from many languages. For various reasons I want to move to mysql, but I constantly run into trouble because of the mysql-collation feature.
One problem is that I am not even able to reliably know what is in my database. (For example I get "?" for non-latin characters and "�" for latin-based characters like öé, etc. - but I have absolutely no idea whether the problem lies in the import from sqlite3 to mysql or in reading from the mysql-database.)
Is there a way to get rid of this "feature" and let mysql do what I tell it without trying to be smart? I use UTF-8 everywhere and I never need any mangling of strings: Input is always UTF-8 and output should be always UTF-8. Also I really would like to know what really is stored in the database - i.e. without a collation-feature corrupting the data during readout.
You could use the MySQL VARBINARY column type, which stores a sequence of arbitrary bytes without interpreting them in any particular charset (or maybe VARCHAR BINARY, which is subtly different).
MySQL uses latin1_swedish_ci unless you specify something different explicitly. That's the opposite of smart. You have to be smart and change that default. This can be done with e.g. the --character-set-server and --collation-server command line options. See Specifying Character Sets and Collations for other means and further options.

migrating mysql DBs from one host to another, encoding issues

I am migrating a very large number of mysql DBs from a few shared web hosts to one shared web host.
The majority of these are Portuguese, so there's quite a few special characters. Some of the DBs which I am migrating are in latin1, some are cp1251, some are utf8.
Of course, simply dumping the DBs, and then restoring the dumps onto the new host completely botches the encoding and "?" characters and other nonsense shows up in the actual websites associated with the databases.
On a small scale, it would be acceptable to muck about with the html charset tags, to know what to dump/restore as, but the problem is that we're dealing with thousands databases and websites, and the migrations are all done automatically via several scripts.
I'm looking for suggestions on the best way of dumping/restoring these DBs assuming that the script doing the work will not know the encoding which is specified in the HTML tags.
So far, I have tried using the actual mysqldump tool, as well as mimicking it with a php script, and dumping to and from memory instead of to and from a text file, neither of these seem to replicate the data perfectly from one to the other without encoding issues.
Should I be using UTF8 to encode the dump, then restoring as is regardless of the html codepage?
Dumping and restoring both in UTF8 regardless of HTML codepage?
Dumping and restoring in the default charset found in each create table statement?
My understanding of the implications and effects of these different scenarios is limited, but what I need to know is basically if there is a way to perfectly replicate data without encoding issues between 2 database servers without knowing the codepage used by the HTML of the script which is accessing the data.
Encodings are a very difficult problem to tackle, especially when moving databases. Try first to do a structural import, and then compare exactly the new structure with the old one, taking special care in database character set, table default character set and columns character sets. You can get these informations very easily from the information_schema database.
Once those are absolutelly mirorred, you can begin the import. However, beware of the fact that you can hold characters in differend encoding types in differend encoded columns (it is quite common to have utf8 valid characters in a latin1 column, latin 1 is a 1 byte character set, while utf8 can have characters of up to 3 bytes).
You can try various methods after this to convert the dumps but as far as i know so far there is not a 100% valid method to convert this type of cases of mixed encoding types in same column. Eventually you might need to do some manual cleanup. But hopefully the first approach will suffice, and everything will be fine.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.