How to convert mysql latin1 to utf8 - mysql

I inherited a web system that I need to develop further.
The system seems to be created by someone who read two chapters of a PHP tutorial and thought he could code...
So... the webpage itself is in UTF8 and displays and inputs everything in it. The database tables have been created with UTF8 character set. But, in the config, there is "SET NAMES LATIN1". In other words, UTF8 encoded strings are populated into the database with forced latin1 coding.
Is there a way to convert this mess to actually store in utf8 and get rid of latin1?
I tried this, but since the database table is set to utf8, this does not work. Also tried this one without success.
I maybe able to do this by reading all tables in PHP with latin1 encoding then write them back to a new database in utf8, but I want to avoid it if possible.

I managed to solve it by running updates on text fields like this:
UPDATE table SET title = CONVERT(CONVERT(CONVERT(title USING latin1) USING binary) USING UTF8)

The situation isn't as bad as you think it is, unless you already have lots of non-Roman characters (that is, characters that aren't representable in Latin-1) in your database already. Latin-1 is a proper subset of utf8. Your web app works in utf8 and your tables' contents are in utf8 as well. So there's no need to convert the tables.
So, try changing the SET NAMES latin1 to SET NAMES utf8. It will probably solve your problem, by allowing your php program's connection to work with the same character set as the code on either end of the connection.
Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

Change column data type
to VARBINARY
and it's automatically convert the latin1 datas
Thank you guys.
I hope it's helpful.

Related

MySQL character set with issue

I have an issue with the character set for my MySQL database. I am trying to support all languages so I have set the collation to utf8_general_ci and the character set variables to utf8.
The columns of the tables (and the tables themselves) are also set to utf8...
This seems to work fine on my local database...
...but not on the live database...
The live database is hosted on AWS RDS and I have set the character set and collation parameters to utf8/utf8_general_ci as above and rebooted.
This is both an issue in Workbench as well as when the data is queried from code. The value has not been saved properly to the database like it has been locally.
Is there something I'm missing?
Something is set up differently. See "best practice" and "question mark" in Trouble with UTF-8 characters; what I see is not what I stored
If you need "all" languages, you should use CHARACTER SET utf8mb4 so that you can handle the 4-byte characters in Chinese.

The dreaded � character and displaying info from database in UTF8

So, I have a database and I use Navicat. We have a simple PHP website which is a few years old and we've upgraded the site to UTF8.
We have 'activities' on the site which handle UTF8 special characters perfectly, but we also have 'comments' on the site and curly single quotes and other special characters show me a �.
The database was converted to UTF via:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
When I look at both databases in Navicat, I can see both are UTF8 and utf8_general_ci.
When I design the table I can see the 'activities' table I can see the cell is a mediumText and is setup with UTF8. When I design the 'comments' section, the cell that isn't working is a Blob and it doesn't have any character encoding info.
We're doing a pretty basic SELECT and then displaying via $vairable[column].
Does anyone know why the 'activities' would work perfectly with UTF8 and the 'comments' would have issues? We're not doing anything super fancy to either of them.
I have tried converting the Blob to a text field, but when I do that the database then escapes it'self when it's outputting to the page, so as soon as there is a single quote in the text it cuts off.
I have tried things like utf8_encode, stripslashes, mysql_real_escape_string, htmlentities, htmlspecialchars, but I'm not sure any of them would help anyway.
Thanks!
blob means binary large object. Binary data does not have any encoding in raw.
So you have latin1 or whatever data in a blob, and you show it and treat it like utf-8 data.
You need to manually convert the data using PHP or whatever.
Here is a good article from the performanceblog that describes what you can do:
http://www.mysqlperformanceblog.com/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/
If you have problems firing your queries, use the console instead of phpMyAdmin and don't forget the connection encoding through SET NAMES
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE comment comment TEXT;
master> SET NAMES utf8;

Converting data from LATIN1 to UTF8 in MySQL

I'm trying migrate a MSSQL database to MySQL. Using MySQL Workbench I moved the schema and data over but having problems converting the character encoding. During the migration I had the tool put text into BLOBS when there was problems with the encoding.
I believe I've confirmed that the data that is now in MySQL is *latin1_swedish_ci*. To simplify the problem I'm looking at ® symbols in one of the columns.
I wanted to convert the BLOBS to VARCHAR or TEXT with UTF8 encoding. I'm running this SQL command on one of the columns:
ALTER TABLEbookdetailsMODIFYBookNameVARCHAR(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Instead of converting the ® it is just removing them which is not what I want. What am I doing wrong? Not that reading half the internet trying to find a solution isn't fun but 3 days in and I think my eyes are about to give out.
MySQL workbench has a UI that is relatively simple to nav. If you need to change the collation of the tables or schemas, you can right click them on the Object Browser and go to alter table, or alter schema there you can change the data types, and set the collation to whatever you want.

How to detect UTF-8 characters in a Latin1 encoded column - MySQL

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.
Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.
This produces results like this:
latin1 utf8
----------------------------------------
Björn Björn
Since your question is not completely clear, let's assume some scenarios:
Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
There is a script on github to help with this sort of a thing.
I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.
Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.