The dreaded � character and displaying info from database in UTF8 - mysql

So, I have a database and I use Navicat. We have a simple PHP website which is a few years old and we've upgraded the site to UTF8.
We have 'activities' on the site which handle UTF8 special characters perfectly, but we also have 'comments' on the site and curly single quotes and other special characters show me a �.
The database was converted to UTF via:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
When I look at both databases in Navicat, I can see both are UTF8 and utf8_general_ci.
When I design the table I can see the 'activities' table I can see the cell is a mediumText and is setup with UTF8. When I design the 'comments' section, the cell that isn't working is a Blob and it doesn't have any character encoding info.
We're doing a pretty basic SELECT and then displaying via $vairable[column].
Does anyone know why the 'activities' would work perfectly with UTF8 and the 'comments' would have issues? We're not doing anything super fancy to either of them.
I have tried converting the Blob to a text field, but when I do that the database then escapes it'self when it's outputting to the page, so as soon as there is a single quote in the text it cuts off.
I have tried things like utf8_encode, stripslashes, mysql_real_escape_string, htmlentities, htmlspecialchars, but I'm not sure any of them would help anyway.
Thanks!

blob means binary large object. Binary data does not have any encoding in raw.
So you have latin1 or whatever data in a blob, and you show it and treat it like utf-8 data.
You need to manually convert the data using PHP or whatever.
Here is a good article from the performanceblog that describes what you can do:
http://www.mysqlperformanceblog.com/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/
If you have problems firing your queries, use the console instead of phpMyAdmin and don't forget the connection encoding through SET NAMES
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE comment comment TEXT;
master> SET NAMES utf8;

Related

How can I determine utf8 data encoding error and correct it in MySql?

I have a website form written in Perl that saves user input in multiple languages to a MySQL database. While it has worked perfectly saving and displaying all characters without problems, in PHPMyAdmin the characters always displayed with errors. However I ignored this since the website was displaying characters OK.
Now I've just recently moved the website to a VPS and the database has seemingly enforced ut8mb4 encoding on the data, so it is now displaying character errors on the site. I'm not an expert and find the whole encoding area quite confusing. My question is, how can I:
a) determine how my data is actually encoded in my table?
b) convert it correctly to utf8mb4 so it displays correctly in PHPMyAdmin and my website?
All HTML pages use the charset=utf8 declaration. MySQL connection uses mysql_enable_utf8 => 1. The table in my original database was set to utf8_general_ci collation. The original database collation (I just noticed) was set to latin1_swedish_ci. The new database AND table collation is utf8mb4_general_ci. Thanks in advance.
SHOW CREATE TABLE will tell you the default CHARACTER SET for the table. For any column(s) that overrode the default, the column will specify what it is set to.
However, there could be garbage in the column. Many users have encountered this problem when they stored utf8 bytes into a latin1 column. This lead to "Mojobake" or "double encoding".
The only way to tell what is actually stored there is to SELECT HEX(col). Western European accented characters will be
one byte for a latin1 character stored in latin1 column.
2 bytes for a utf8 character stored in 1 utf8 character or into 2 latin1 characters.
several bytes for "double encoding" when converted twice.
More discussion: Trouble with UTF-8 characters; what I see is not what I stored

How to convert mysql latin1 to utf8

I inherited a web system that I need to develop further.
The system seems to be created by someone who read two chapters of a PHP tutorial and thought he could code...
So... the webpage itself is in UTF8 and displays and inputs everything in it. The database tables have been created with UTF8 character set. But, in the config, there is "SET NAMES LATIN1". In other words, UTF8 encoded strings are populated into the database with forced latin1 coding.
Is there a way to convert this mess to actually store in utf8 and get rid of latin1?
I tried this, but since the database table is set to utf8, this does not work. Also tried this one without success.
I maybe able to do this by reading all tables in PHP with latin1 encoding then write them back to a new database in utf8, but I want to avoid it if possible.
I managed to solve it by running updates on text fields like this:
UPDATE table SET title = CONVERT(CONVERT(CONVERT(title USING latin1) USING binary) USING UTF8)
The situation isn't as bad as you think it is, unless you already have lots of non-Roman characters (that is, characters that aren't representable in Latin-1) in your database already. Latin-1 is a proper subset of utf8. Your web app works in utf8 and your tables' contents are in utf8 as well. So there's no need to convert the tables.
So, try changing the SET NAMES latin1 to SET NAMES utf8. It will probably solve your problem, by allowing your php program's connection to work with the same character set as the code on either end of the connection.
Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html
Change column data type
to VARBINARY
and it's automatically convert the latin1 datas
Thank you guys.
I hope it's helpful.

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.

Internationalization SQL/PHP/HTML

I am attempting to learn about internalization in websites, so I am tampering with MySQL config file, field collations and html header character set type.
I basically have a form where I type some unicode characters in a text field, store it in a database, then output it back to the browser.
First scenario: HTML=>utf8 MySQL=>UTF8, that worked OK. However, when I viewed the database from PhPMyAdmin, weird characters were resident in the field.
Second scenario: I configured the VARCHAR in the database to be Latin1 by choosing a swedish_ci collation. The HTML remained utf8. I entered a unicode string in the form. Yet, the browser still displays the correct characters that I entered!!!
To make things more complicated for me to comprehend, I downloaded the mysql world database which is a database of all countries and cities of the world. The tables are Latin1 encoded. When I try to display them in a utf8 html, it displays weird characters for non-English characters. It works OK when my html character set is ISO-8591-1
If you are using utf-8 you should allways use utf-8 for HTML, utf8 charset in MySQL and utf8_general_ci or utf8_unicode_ci for MySQL collation. And your php document must be saved as utf-8 without BOM.

Latin1_swedish_ci Encoding in MySQL

I am using UTF-8 encoding on my website. Lately I have been storing chinese/spanish/russian names in my MySQL tables and then printing them with PHP on a page generated with a charset of UTF-8. The page works fine and I see all the letters correctly. However, I just realized that my table is set with latin1_swedish_ci charset. How is it possible that even though I stored these names with latin1_swedish_ci charset, serving them on my site with UTF-8 still shows them up correctly?
Thanks!
Joel
Because mysql connection is still using latin1,
you should treat these data is in UTF-8 but store in latin1 environment.
So, to prove it,
show variables like '%char%';
the above should return most of the setting is in latin1
apply
set names utf8;
And you would see all the UTF-8 become double encoded (garbled)
You would be interested in thread:
A script to change all tables and fields to the utf-8-bin collation in MYSQL
They are dealing there with ways of repairing same things you have done.
Note that when used in primary keys, size of varchar encoded in utf8 count triple, so the maximal length for single primary key column is varchar(333).