Accent insensitive search on a problematic database - mysql

I have a database that contains data in different languages. Some languages use accents (like áéíóú) and I need to search in this data as the accents doesn't exist (search for 'campeon' should return 'campeón' as a valir result).
The problem is that the tables in my database (utf8_unicode_ci) are not storing utf8 characteres. If you see the data through phpmyadmin the words with accents looks like this: campeón
After some researching, I've found (in a StackOverflow question) that the problem is related to the inexistence of a SET NAMES [charset]. In fact, I've made some testings and if I set names to utf8, everything works as expected.
Well, I have the solution, what's the problem? The problem is that the database is in production, so there are thousands of strings in the database. If I change the character set the client will use, all already existing string will become invalid. The question is: is there any way to:
perform accent-insensitive searches in a database that uses a wrong charset like mine?
transform safely the data in the tables to the appropriate charset?
continue working with mixed charsets (latin1 and utf8) in the database, assuming that latin1 data will not be accent-insensitive?
If anybody has experience in any of the solutions I propose or has a new one, I'll be very thankful if share.

The problem being that the data was inserted using the wrong connection encoding, you can fix it by
Exporting the data using the wrong connection encoding, just like you have used it thus far, followed by
Importing the data using the correct utf8 connection encoding.
That will fix the encoding problem, after which search will work as expected.

What if you create a copy of the table at the beginning of your session, alter the copy's charset, perform all your queries from that, and then drop the table at the end of your session? I don't know how practical this would be - depends on how often you need to perform these queries and how big the table is.

Related

MySQL encoding weird characters

As many people already had, I have a problem in MySQL with the encoding of my data.
More specifically, the collation of the table seems to be utf8_general_ci. The data inserted is inserted well, but when a select is done, some characters get translated badly:
Marie-Thérèse becomes Marie-Thérèse.
Is it possible to do a select and translate these characters back to the original value, or is it impossible? It's harder to change the original table in my case, so I'd rather solve it in my select query.
When using phpmyadmin (or the like) and looking at those entries, are those entries okay?
update: if not, the inserts are probably flawed already, and the connection from the insertion script must be adapted.
If so, then it's not technically MySQL's fault but the software connecting to it. See for example: UTF-8 all the way through . You have to set some parameters on/after opening the connection.
btw: The collation should be irrelevant. http://dev.mysql.com/doc/refman/5.7/en/charset-general.html
The gist is: a collation tells the you, how you have to order/compare strings, which is mainly important for special characters like äöü in German or àéô in French/... because their local/regional collation say, ä is - for ordering purposes - exactly like a (for example), in another collation, ä could be distinctly after a or even after z.
In the end it seems like the problem was with running it all through a cronbjob.
We run a script through a cronjob that generates the insert statements. Apparently, when running the script manually, everything goes well, but when running the same script through a cronjob, the data got messed up. We solved that by this article:http://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/
We had to add a variable LANG in the etc/environment file.

How to sort right to left language data in mysql?

I have a MySQL database that is storing Persian data and information.
the information are names and I want to sort names by alphabet. but MySQL don't know Persian language, and some other right to left languages.
How can I sort them?
and my other problem is with phpmyAdmin, phpmyAdmin can't show Persian language data and show some character instead of that
About the first question; as fancyPants said, use the proper collation and you should be fine. Sorting is handled by collations and there is a utf8 Persian collation available.
About your second problem:
Almost certainly what is happening is that you're improperly storing the data. As Sid M said, knowing what you've tried and how your system is running would be a big help, but these questions almost always end up being misconfigured or poorly written software. phpMyAdmin and MySQL can deal just fine with multiple character sets. Presumably, you'll want to use utf8.
Set up your database and tables properly, then make sure your client application is configured properly (likely using SET NAMES 'UTF8' or mysql_set_charset('utf8'), but read the links for more detail than is worth including here).
See https://wiki.phpmyadmin.net/pma/Garbled_data and How to display UTF-8 characters in phpMyAdmin? for starters and SQL injection that gets around mysql_real_escape_string() for way more information than you probably wanted to learn :)

migrating mysql DBs from one host to another, encoding issues

I am migrating a very large number of mysql DBs from a few shared web hosts to one shared web host.
The majority of these are Portuguese, so there's quite a few special characters. Some of the DBs which I am migrating are in latin1, some are cp1251, some are utf8.
Of course, simply dumping the DBs, and then restoring the dumps onto the new host completely botches the encoding and "?" characters and other nonsense shows up in the actual websites associated with the databases.
On a small scale, it would be acceptable to muck about with the html charset tags, to know what to dump/restore as, but the problem is that we're dealing with thousands databases and websites, and the migrations are all done automatically via several scripts.
I'm looking for suggestions on the best way of dumping/restoring these DBs assuming that the script doing the work will not know the encoding which is specified in the HTML tags.
So far, I have tried using the actual mysqldump tool, as well as mimicking it with a php script, and dumping to and from memory instead of to and from a text file, neither of these seem to replicate the data perfectly from one to the other without encoding issues.
Should I be using UTF8 to encode the dump, then restoring as is regardless of the html codepage?
Dumping and restoring both in UTF8 regardless of HTML codepage?
Dumping and restoring in the default charset found in each create table statement?
My understanding of the implications and effects of these different scenarios is limited, but what I need to know is basically if there is a way to perfectly replicate data without encoding issues between 2 database servers without knowing the codepage used by the HTML of the script which is accessing the data.
Encodings are a very difficult problem to tackle, especially when moving databases. Try first to do a structural import, and then compare exactly the new structure with the old one, taking special care in database character set, table default character set and columns character sets. You can get these informations very easily from the information_schema database.
Once those are absolutelly mirorred, you can begin the import. However, beware of the fact that you can hold characters in differend encoding types in differend encoded columns (it is quite common to have utf8 valid characters in a latin1 column, latin 1 is a 1 byte character set, while utf8 can have characters of up to 3 bytes).
You can try various methods after this to convert the dumps but as far as i know so far there is not a 100% valid method to convert this type of cases of mixed encoding types in same column. Eventually you might need to do some manual cleanup. But hopefully the first approach will suffice, and everything will be fine.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.