How to solve collation error in Mysql? - mysql

I migrated a Microsoft SQL database to Mysql and I hat some collation problems in the rows in Mysql, I tried to change the collation but the erros still there. The data is goning to be in a Wordpress, so I tried the Database Collation Fix pluguin but doesn't work.
The table afected is wp_posts in post_title and post_content. All the characters that contain an accent or 'ñ' in Spanish are replaceed by a random character.
I already tried with utf8_spanish_ci and utf8mb4_spanish_ci.
Any suggestions?
Microsoft SQL database collation: Modern_Spanish_CI_AI
Mysql database collation: UTF8 Defaul Collation
Thanks

I don't know if this helps you, but the collating orders in MySQL's Modern Spanish utf8_spanish_ci and/or utf8mb4_spanish_ci collations are different from those in utf8_unicode_ci and/or utf8mb4_unicode_ci.
Modern Spanish collation handles N and Ñ as separate characters, with Ñ coming directly after N. Generic latin-language collation treats them as variants of the same character. So, if you want Spanish collation -- that is, if you're dealing with lots of proper names and so forth -- you'll need to use the Spanish collation for this data.

If ñ turned into ?, you have one type of problem.
If ñ turned into ñ, you have "Mojibake".
If ñ turned into �, it's yet another problem.
Please be more specific, since the solutions are quite different.
Trouble with utf8 characters; what I see is not what I stored provides information on the common issues.
The "Collation" is not relevant to ñ being replaced by a 'random character'. Only the CHARACTER SET is relevant.
When you get into comparing or sorting strings, then the COLLATION becomes relevant. I think the only difference between ..._spanish_ci and ...spanish2_ci is the handling of ch and ll.

Related

What COLLATE should i set to use all kind of possible languages?

I have a column called username, i want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should i set on my database and tables?
I'm using utf_general_ci, i'm new so i don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because i will not use preg_replace or a function to replace special chars, i will only use prepared statement to avoid SLQ injection and protect by database.
First choice (MySQL 8.0): utf8mb4_0900_ai_ci
Second choice (as of 5.6): utf8mb4_unicode_520_ci
Third choice (5.5+): utf8mb4_unicode_ci
Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci
The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.
No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.
utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)
If Portuguese is the focus:
See https://pt.stackoverflow.com/ and MySQL collation for Portugese .
Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.
(Else go with the 'choices' listed above.)
If you are using MySQL 5.5.3 or higher, I would recommend UTF-8 character encoding utf8mb4_unicode_ci . AFAIK it supports most, if not all languages, and implements the Unicode standard for sorting and comparison. As a second choice, have a look at utf8mb4_general_ci, which may be faster but also less accurate.
See this excellent SO post for (many) more details, or check out the official MySQL doc.
Below 5.5.3, utf8_unicode_ci is your friend.
COLLATION refers to ordering (as in comparisons in WHERE and ORDER BY); you should really ask about CHARACTER SET:
Pre-5.5.3: utf8 (aka utf8mb3) handles all languages, except for a few Chinese characters and Emoji.
5.5.3 forward: utf8mb4 - Handles everything. Outside of MySQL, it is spelled "UTF-8".

Which collation to use so that `ş` and `s` are treated as unique values?

The issue is that ş and s are interpreted by MySQL as identical values.
I'm new to MySQL, so I have no idea which collations would view them as unique.
The collations that I've tried using which don't work are:
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
Does anybody know which collation to use?
P.S. I also really need the collation to interpret emojis and other non-Latin characters, and, to my knowledge of MySQL and collations, the only collation able to do this is unicode?
utf8_turkish_ci and utf8_romanian_ci -- as shown in http://mysql.rjweb.org/utf8_collations.html
(Plus, of course, utf8_bin.)
For your added question: You are looking for a "character set" (not a "collation") that can represent Emoji and other non-Latin characters -- UTF-8 is the one to use. In MySQL, it is utf8mb4. The "collations" that are associated with that are named utf8mb4_.... Collations control ordering and equality, as indicated in the first part of your question about s and ş.
MySQL's CHARACTER SET utf8 is a subset of utf8mb4. Either can handle all the "letters" in the world. But only utf8mb4 can handle Emoji and some Chinese characters.

MySQL Swedish Collation and 'é'

I've got a MySQL database that stores Swedish characters (not part of the PK, though) and does selects on those characters.
I don't have a ton of experience with this kind of stuff, but I had previously set the collation to "utf16_swedish_ci", which seems to have worked fine for a long time and was able to differentiate the similar characters (like ä vs a and é vs e) in select statements.
Lately, though, I noticed that using that collation seems to always consider é and e the same (though it seems to distinguish all of the other similar Swedish characters fine).
Did something change regarding that in newer versions of MySQL? Or should that have always been the case and I just didn't notice it until now? What collation should I be using to uniquely identify all the Swedish characters that won't have any weird side effects?
Thanks in advance!
å, ä and ö are part of the native Swedish alphabet, and don't need any special treatment. However, é is not native, and relies on accent rules for collation.
As far as I am aware, to get accent sensitive collation in MySQL, you need to use one of the binary collations - eg utf16_bin, which unfortunately also is case sensitive.
What version of MySQL are you using, and have you recently updated to a newer version? If you have, then rolling back to a previous version could solve your collation issues. I know there were some changes to the collations included in version 8.x.x, so maybe that is what you are experiencing.
In most (including swedish_ci) utf8 or utf8mb4 collations E=é. Exceptions: _bin and _icelandic_ci. See http://mysql.rjweb.org/utf8_collations.html and http://mysql.rjweb.org/utf8mb4_collations.html
Note that most collations end with _ci, which implies both case folding and (mostly) ignoring accents.
Do not use utf16 or utf32; use only utf8/utf8mb4.
MySQL has no collations that treat case and accent differently.
The only incompatible change in collations has been in 5.0 with the German ß. It was a fiasco; MySQL will never change a collation again -- though it may add new collations.

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

Should I migrate a MySQL database with a latin1_swedish_ci collation to utf-8 and, if so, how?

The MySQL database used by my Rails application currently has the default collation of latin1_swedish_ci. Since the default charset of Rails applications (including mine) is UTF-8, it seems sensible to me to use the utf8_general_ci collation in the database.
Is my thinking correct?
Assuming it is, what would be the best approach to migrate the collation and all the data in the database to the new encoding?
UTF-8, as well as any other Unicode encoding scheme, can store characters in any language, so it is an excellent choice of codepage for your database.
The collation setting, on the other hand, is a completely separate issue from the encoding scheme. It involves sort orders, upper/lowercase conversions, string equality comparisons, and things like that which are language-specific. The collation setting should match the language that is used in the database.
The UTF-8 general collation is (I am assuming here—I'm not familiar with MySQL in particular) used for situations where the language is unknown and some simple default ordering is needed. It probably corresponds to the Unicode code point ordering, which is almost certainly not what you want if you're storing Swedish.
Convert to UTF-8 as the charset.
Collation settings are only used for sorting and stuff like that. Choose the collation that most of your users would expect.
Providing your existing data in the database is CORRECTLY encoded in latin1, converting the tables to utf8 (using ALTER TABLE, as described in the docs) should just work.
Then all your application needs to do is continue doing whatever it did before. If your application wants to use unicode characters, it should set its connection encoding to utf8 and use utf8, but that's its own problem.
The problem is that a large number of crap web apps have historically sent utf8 data to mysql and told it to treat it as latin1. MySQL will honour this perfectly and save junk into the tables, as instructed.
Converting the tables from latin1 to utf8 will NOT repair this mistake, as you genuinely do have total rubbish in there. Repairing them is nontrivial, particularly if during the lifetime of the app it's been talking different types of rubbish to the database.
Use below mysql query to convert your column :
ALTER TABLE users MODIFY description VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
To see full details about your table :
SHOW FULL COLUMNS FROM users;