How to convert Mysql encoding utf8 to utf8mb4 in Rails project - mysql

I have a Rails 3.2 project using Mysql 5.5.34, with utf8 encoding. Now I found that with utf8 encoding Mysql could not save unicode characters which represent emoji.
So is it OK for me to convert the whole database to use utf8mb4 encoding that I found on the web that could hold 4 byte unicode include emoji?
Is all the information I have in the database covered by utf8mb4 encoding? Will I face data loses if I do that?
Is there any way that Rails provide to do that?
Thanks a lot for helping.

Actually you just need to migrate the column you want to encode with utf8mb4.
execute("ALTER TABLE yourtablename MODIFY yourcolumnname TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;")
If you plan to migrate the data itself it might not be possible, since the common utf8 consists out of 3 byte chars and the utf8mb4 out of 4 byte. So you might already have corrupt data in your db.
Furthermore Rails 3.2 has an encoding issue within ActiveSupports JSON encoding. In case you plan to work with json and emojis, you will need to add a patch like the following (based on the solution in rails 4 https://github.com/rails/rails/blob/4-0-stable/activesupport/lib/active_support/json/encoding.rb) or just simply upgrade to rails 4.
module ActiveSupport
module JSON
module Encoding
class << self
def escape(string)
if string.respond_to?(:force_encoding)
string = string.encode(::Encoding::UTF_8, :undef => :replace).force_encoding(::Encoding::BINARY)
end
json = string.gsub(escape_regex) { |s| ESCAPED_CHARS[s] }
json = %("#{json}")
json.force_encoding(::Encoding::UTF_8) if json.respond_to?(:force_encoding)
json
end
end
end
end
end

Related

When I store data from servlet to mysql database characters like "<", ">" are stored as unicode format like u003c rather than the actual symbol

I'm trying to store data in my database test like "< hi >" is stored as "u003c hi u004e" my database uses utf8mb4 charset. But also I experimented with utf8, utf 32, etc. It didn't work out.
The problem is not with the Mysql Db. The problem is with gson/json library.
Use this code to solve it.
Gson gsonBuilder =new GsonBuilder().disableHtmlEscaping().create();

MySQL Multilingual Encoding | Error Code: 1366. Incorrect string value: '\xCE\x09DIS'

I am trying to set up a database to store string data that is in multiple languages and includes Chinese letters among many others.
Steps I have taken so far:
I have created a schema which uses utf8mb4 character set and utf8mb4_unicode_ci collation.
I have created a table which includes CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; at the end of the CREATE statement.
I am attempting to LOAD DATA INFILE from a CSV file with CHARACTER SET utf8mb4 specified in the LOAD statement.
However, I am receiving an error Error Code: 1366. Incorrect string value: '\xCE\x09DIS' for column 'company_name' at row 43630.
Did it successfully parse 43629 rows? Then croak on that row? It may actually be garbage in the file.
Do you know what that company name should be? What does the rest of the line say?
Do you have another example? Remove that one line and run the LOAD again.
CE can be interpreted by any 1-byte charset, but not necessarily in a meaningful way.
09 is the "tab" character in virtually all charsets; is it reasonable to have a tab in a company name??

Mysql JSON type have messy code using spring data jpa

I have set MySQL's character set to utf8mb4, and it works fine for varchar type, saving and reading Chinese character works fine.
But when it comes to JSON type, saving works fine, while reading JSON as string using spring-data-jpa, it get messy code.
I have tried the below settings, it doesn't work.
spring.datasource.url = jdbc:mysql://localhost:3306/TAIMIROBOT?&useUnicode=yes&characterEncoding=UTF-8
spring.datasource.init-sql="SET NAMES utf8mb4 COLLATE utf8mb4_bin;"
This issue has been fixed. bugs.mysql.com/bug.php?id=80631
ResultSet.getString() sometimes returned garbled data for columns of the
JSON data type. This was because JSON data was binary encoded by MySQL using
the utf8mb4 character set, but decoded by Connector/J using the ISO-8859-1
character set.
The fix has been included in Connector/J 6.0.5. The entry for the 5.1.40
changelog has been included into the 6.0.5 changelog.
If you have the same problem, just update your connector version in your maven pom file to 6.0.5 (if you are using maven).

Using iconv to convert mysqldump-ed databases

Trying to quickly convert a latin1 mysql DB to utf8, I tried the following:
Dump the DB
run iconv -f latin1 -t utf8 on the resulting file
import into a fresh DB with UTF8 default encoding
This mostly works except... some letters get converted wrong (an example: uppercase accented 'U' becomes some garbled sequence starting with a question mark). Some conversion is taking place (od an a query result shows a two byte sequence where the latin1 byte was) and te latin1 version is alright. While I have so far been unsystematic in isolating the problem (late night; under deadline; etc.) the weirdness of the issue kills me: why would it fail on some letters and not all? Client connection? Column charset? Why I am not getting any diagnostics? I'm stymied.
Sure, I can work on isolating the issue and its details, but thought that maybe somebody ran into this already and can recognize it by this (admittedly rather poor) description.
Cheers
The data may have been stored as latin1 but it's possible that what ever client you used to dump the data has already exported it as UTF-8.
Open the dump file in a decent text editor (Notepad++, TextWrangler, Atom) and check which encoding allows all characters to be displayed properly.
Then when it comes to import the data back in, ensure your client is set to use UTF-8 on the import.
Don't use iconv, it only muddies the works.
Assuming that a table is declared to be latin1 and correctly contains latin1 bytes, but you would like to change it to utf8, do this to the table:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
It is also possible to do it with a dump and reload; it involves some changes to the arguments. Sorry I don't have the details.

Doctrine - load a YAML fixture with French characters

My Doctrine 1.2 is integrated inside CodeIgniter as a hook and I know that my char-set is utf8 with collation utf8_unicode_ci.
I have two YAML files, one for creating the DB and its tables and one to load some test data. My data can contain French accents (çéïë...). In my schama.yml I have correctly specified the collation and char-set:
options:
type: INNODB
charset: utf8
collate: utf8_unicode_ci
I double checked the settings in phpMyAdmin, everything is correct.
When I run my doctrine script from commandline to load my fixture to my one of my tables, all the French accents are replaced by junk!
Am I missing a setting or configuration or is there a bug in Doctrine?
You should have in your /config/database.php Doctrine connection:
// Load the Doctrine connection
$doctrine = Doctrine_Manager::connection($db['default']['dsn'], $db['default']['database']);
To fix the problem with the encoding you have to add this line:
$doctrine->exec('set names utf8');