Is it safe to update tables from latin1 to utf8 - mysql

Whilst doing some checking for a client to see if their site was still functioning well I found a random page that contained a bunch of weird characters like ¿½.
I think this has to do with the tables having a latin1 encoding instead of utf-8. But seeing as no other pages are affected that use the same table could there be another error. I did check if the text itself was safe and making sure it was just clean text.
So I have 2 questions, the main one being is it safe to just update this one table to utf8 and if not what causes this error and wondered why would this only affects one certain page.
(Side note the website is built using typo3)
Ofcourse I have live example's the links are:
Site 1: With weird text characters
Site 2: Same table, but no weird characters

Ultimately the client connecting to the database decides how their encodings are handled; that's known as the connection encoding. Whatever encoding the text is stored as in MySQL, it will be converted on the fly to/from the client's connection encoding. As such, just changing the underlying column's storage to utf8 doesn't affect anything.
However, that in itself also won't "fix" anything. The characters will still be garbage. You'll also have to convert the actual characters to the correct data. Otherwise you'll just have "¿½" stored encoded as utf8 instead of "¿½" stored encoded as latin1. And changing those characters will likely affect any other client which has been doing it wrong so far, so the client side needs to be fixed at the same time.

Related

Storing swedish characters in mysql database

I'm having problems storing Swedish characters in my MySQL database. I want to store them in my table called users with the collation utf8-bin. Even though I'm using utf8, the characters å ä ö gets stored as Ã¥ ä ö and I don't know why. Retrieving the data and echoing it gives me the same output, with the weird characters instead of å ä ö. Any help is appreciated.
Call
mysql_set_charset("utf8");
After connecting and before making any queries.
Your database charset is just for storage, not for transmission between app and database.
There are several places, where you have to pay attention to the encoding.
Database: you already use an utf8 collation, so that's fine
Database connection: use mysqli_set_charset to set the charset of the connection, if you're using mysqli. Other database drivers have similar functions.
Output encoding of the page: You can use HTTP headers or meta tags. If you want to be on the safe side, specify both.
You should make sure that the database connection uses the Swedish encoding and the encoding of the page output is correct as well. Different encoding causes many of these problems. Read more about character encodings here.
There are several parameters that you have to consider here. For this to work well now and in the future. ALL different interactions with the text has to be in same encoding. Even within db (for joins to work well etc).
The encoding of the data beeing inserted (set in header of page and / or utf_8 encoding when inserted).
The encoding in db tables, (i would recommend utf8_swedish for all)
The encoding of the page viewing results from db (set this in header)
The encoding of the page beeing edited. Its possible to open documents in different encoding. This is a big issue if you are not familiar with it. Open and save documents in right encoding.
There use to be a problem concerning the connection encoding to, set this correct, but today it is a smaller problem than a couple of years ago, because of changes.
A couple of notes. Are you sure your data is stored like that, or just presented wrong, via for instance phpmyademin? Try to print with print utf8_encode($text)
Or, utf8_decode() function, that gives you some insight...

migrating mysql DBs from one host to another, encoding issues

I am migrating a very large number of mysql DBs from a few shared web hosts to one shared web host.
The majority of these are Portuguese, so there's quite a few special characters. Some of the DBs which I am migrating are in latin1, some are cp1251, some are utf8.
Of course, simply dumping the DBs, and then restoring the dumps onto the new host completely botches the encoding and "?" characters and other nonsense shows up in the actual websites associated with the databases.
On a small scale, it would be acceptable to muck about with the html charset tags, to know what to dump/restore as, but the problem is that we're dealing with thousands databases and websites, and the migrations are all done automatically via several scripts.
I'm looking for suggestions on the best way of dumping/restoring these DBs assuming that the script doing the work will not know the encoding which is specified in the HTML tags.
So far, I have tried using the actual mysqldump tool, as well as mimicking it with a php script, and dumping to and from memory instead of to and from a text file, neither of these seem to replicate the data perfectly from one to the other without encoding issues.
Should I be using UTF8 to encode the dump, then restoring as is regardless of the html codepage?
Dumping and restoring both in UTF8 regardless of HTML codepage?
Dumping and restoring in the default charset found in each create table statement?
My understanding of the implications and effects of these different scenarios is limited, but what I need to know is basically if there is a way to perfectly replicate data without encoding issues between 2 database servers without knowing the codepage used by the HTML of the script which is accessing the data.
Encodings are a very difficult problem to tackle, especially when moving databases. Try first to do a structural import, and then compare exactly the new structure with the old one, taking special care in database character set, table default character set and columns character sets. You can get these informations very easily from the information_schema database.
Once those are absolutelly mirorred, you can begin the import. However, beware of the fact that you can hold characters in differend encoding types in differend encoded columns (it is quite common to have utf8 valid characters in a latin1 column, latin 1 is a 1 byte character set, while utf8 can have characters of up to 3 bytes).
You can try various methods after this to convert the dumps but as far as i know so far there is not a 100% valid method to convert this type of cases of mixed encoding types in same column. Eventually you might need to do some manual cleanup. But hopefully the first approach will suffice, and everything will be fine.

Which UTF8 - phpMyAdmin display Chinese, Russian, Arabic correctly?

I need to input content in 10 languages in MYSQL database (problem ones are: Chinese, Russian, Arabic) and client should be able to read and edit them through phpMyAdmin as well as admin area.
I have used utf8-bin, utf8_unicode_ci, utf8_general_ci, but the characters does not show properly in phpMyAdmin. In addition I need to consider the search and sort problems and as I can't understand the above languages I am worrying that some characters might be escaped or mapped incorrectly.
Which UT8 is the best in this case?
Is it normal for phpMyAdmin to display characters as '動力å“牌çµåˆå“牌與科技'?
How can I make phpMyAdmin display the content in human readable way?
There is nothing wrong with your database it seems (unless the database's contents is also UTF-8 mojibake, being double-mojibaked on the way to your browser); the output example you have included looks like that your browser's encoding to interpret the phpMyAdmin page in is incorrect, most likely some ISO-8859 variant. Check and make sure that your browser's encoding is UTF-8.
The different collations specify different rules for sorting and searching, but the encoding is still the same. If you are storing multiple languages in the database, use utf8_general_ci.

WordPress encoding problem

I'm having what seems to be a problem related to WordPress, though it could be something else.
Here's what's happening:
I have a blog with posts using utf-8 characters (simple ones like ’). The characters all display correctly currently, however I'm moving my site to another server and seeing problems with all the utf-8 chars (’ becomes ’).
I first thought the problem was with MySQL, but after looking into it it seems not to be the case. I created the new database by doing a synch with Navicat, and have confirmed that both db's and all tables are utf-8. When viewing the data in either db in any SQL program I've tried (Sequel Pro, Navicat) the chars show up unencoded (’). I've tried various synching methods, including ones that others have said solved encoding problems, but they did not work for me.
What confirmed it for me, was setting up a test php script which pulled a single post_content field from each database. In the test script the chars show up encoded (’) regardless of which db they come from.
I checked the apache config file and found that HTTP_ACCEPT_CHARSET is set to the same (ISO-8859-1,utf-8;q=0.7,*;q=0.7) on both systems.
Soooo, I'm left thinking that it's a WordPress issue, though of course I could be wrong.
Any help would be truly appreciated, I’ve been banging my head on this for awhile now ;)
Thanks.
What you are seeing is UTF-8 data being interpreted as if it were ISO-8859-1 (or Win-1252, or another single-byte encoding). Problems like this are almost always a mismatch between the headers being sent to the browser and the actual encoding. Something is telling the browser that the stream is ISO-8859-1 while actually sending UTF-8.
So, I've finally ended up using a plug-in to solve the problem. Here are the steps I took:
Migrate the structure and content of the old database to the new database using Navicat for MySQL (though I think any method of copying will work).
Change the encoding of the columns in the wp_posts table to utf8 using ALTER TABLE 'wp_posts' CHANGE COLUMN 'post_content' 'post_content' longtext CHARACTER SET utf8 NOT NULL after 'post_date_gmt';
Use the ISO to UTF content plug in to convert any non-encoded chars innthe table to utf chars.

Croatian diacritic signs in MySQL db (utf-8)

Diacritic signs http://img98.imageshack.us/img98/3383/dijakritickiznakovi.gif
So, symbols belows display title should be displayed that way.
UTF-8 entities are listed below HTML (utf-8) title (here is list: LINK)
And last line shows what is stored in my database.
Collation of db table is utf8_unicode_ci.
I suppose that symbols in db shouldn't be as they are in my case?
They are displaying correctly on page when loaded from database, but they all of them are not displayed by utf-8 table from given link. Even if I see them correctly maybe someone other won't?
Setting the MySQL table charset is not enough - you should also take care to set the correct charset for the client, the connection and the results, which defaults may differ from server to server making your database less than portable: the same database content might be displayed differently moving to another server.
I've been storing slovenian text into MySQL for some time now and this is what works for me:
the first thing you do after connecting should be to issue a "SET NAMES utf8" query
make sure that the strings you're storing are utf-8 to start with: if you're taking them from a web page form make sure the page is UTF-8
be careful what tools do you use to browse/edit the database contents online: PhpMysqlAdmin is definitely unsafe.
Hope this helps.
You appear to be trying to store HTML-encoded strings in your database. Don't do that, it will only break your ability to do string operations like searching reliably. You should be able to store raw UTF-8 encoded characters as bytes in your database.
You don't say what environment you're using to read the database or how you get the ‘incorrect’ string at the bottom (which is UTF-8 bytes read using ISO-8859-1 encoding). If they appear in your web page (and you're specifying UTF-8 in the headers and/or <meta> tag), you're presumably pretty much there.