utf8mb4 setting for talend - not working - mysql

I am migrating the data from sql server to mysql. I am using the tool Talend(ETL) for the same.
The problem comes when I have emojis in the source (sql server) , it does not get inserted to the table in mysql. So, I know I must use utf8mb4 on mysql side.
The client settings character encoding has to be set, for the smileys to get inserted. The database, tables and the server are all on utf8mb4
But, the client i.e., talend is not utf8mb4. So where do I set this?
I tried with 'set names utf8mb4' in additional parameters of tmysqloutput. But this does not work
I have been stuck on this for days, any help on this would be greatly appreciated
Update :
The job looks like this now. But, the smileys are still getting exported as '?'
Thanks
Rathi

First, make sur that your server is properly configured to use utf8mb4.
Following this tutorial, you need to add the following to your my.cnf (or my.ini if you're on Windows):
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
That tells MySQL server to use utf8mb4 and ignore any encoding set by client.
After that, I didn't need to do set any additional properties on the MySQL connection in Talend. I've executed this query in Talend to check the encoding set by it :
SHOW VARIABLES
WHERE Variable_name LIKE 'character\\_set\\_%' OR Variable_name LIKE 'collation%'
And it returned:
|=-----------------------+-----------------=|
|Variable_Name |Value |
|=-----------------------+-----------------=|
|character_set_client |utf8mb4 |
|character_set_connection|utf8mb4 |
|character_set_database |utf8mb4 |
|character_set_filesystem|binary |
|character_set_results | |
|character_set_server |utf8mb4 |
|character_set_system |utf8 |
|collation_connection |utf8mb4_unicode_ci|
|collation_database |utf8mb4_unicode_ci|
|collation_server |utf8mb4_unicode_ci|
'------------------------+------------------'
The following test to insert a pile of poop works:
Update
Using native MySQL components in Talend 6.3.1, you get mysql-connector-java-5.1.30-bin.jar, which is supposed to automatically detect the utf8mb4 used by the server, but for some reason (bug?) it isn't doing that.
I switched to using JDBC components, and downloaded the latest mysql connector (mysql-connector-java-5.1.45-bin.jar), I got it working by setting these additional parameters on the tJDBCConnection component :
useUnicode=true&characterEncoding=utf-8
(even if I'm specifying utf-8, the doc says it will treat it as utf8mb4)
Here's what my job looks like now :

Related

Upgrade to MySQL 8. Unknown character set index for field '255' received from server. Exception

The mysql5 db has been used to latin1 and latin1_general_ci character encoding settings, all tables are used also this character set.
After migration to mysql8(create and restore dump from mysql5-to-mysql8) the error: Unknown character set index for field '255' received from server. occurs when doing connection to db.
Why does it happens?, I suppose that it may be related with a fact that mysql8 uses utf8mb4 character set encoding as a default.
But utf8mb4 encoding is more wide than latin1 so it must to be support to migration from latin1(1Byte) to utf8mb4(4Bytes) not vice verse.
The changing character set of the db to latin1 and collation to latin1_swedish_ci doesn't take affect.
There is some mysql8 params:
SHOW VARIABLES LIKE 'char%';
character_set_client utf8
character_set_connection utf8
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8
character_set_server utf8mb4
character_set_system utf8
character_sets_dir C:\Program Files\MySQL\MySQL Server 8.0\share\charsets\
MySql8 connector driver is:
mysql-connector-java v. 3.1.14
I have this issue when I am executing scripts from my Java application. It is showing the same error as mentioned in the subject line.
The fix is(in my case,as we earlier use latin1): provide character encoding type in connection url.
?characterEncoding=latin1
Here is the list of available types which are used between java and mysql. This is the official documentation of java-mysql connector jar.
The only solution for this problem is
download the latest version of connector from the following website..
https://dev.mysql.com/downloads/connector/j/
When you go there there was a option to select operating system if you use windows then select the platform independent option then a download option appears ... download the zip file and extract it.Then go to your project and add this jar file into the libraries.So that's all your problem will be solved.
Hope this solution work for you...

Squeryl utf8mb4 support

I'm using Squeryl to work with a MySQL database. The tables are in utf8mb4 encoding. Now I want to insert some utf8 (4 byte) strings into the db through Squeryl. How do I do that?
I tried to set ?useUnicode=yes&characterEncoding=UTF-8 to my connection url but apparently, UTF-8 here is 3 byte to MySQL so it doesn't work.
I found this StackOverflow answer, but after some digging, I don't see anyway to append my queries with SET NAMES utf8mb4; (changing database config and environment is not an option)
Example string: อลิซร้องเพลงตามเลยค่ะ😂😂😂
Error when trying to insert the string:
Exception in thread "main" org.squeryl.SquerylSQLException: Exception while executing statement : Incorrect string value
Be sure not to connect as root.
Have this in my.cnf (in the [mysqld] section)
init_connect = SET NAMES utf8mb4

SET NAMES utf8mb4

We are using Dropwizard, JDBI, MySql 5.6 and mysql connector 5.1.32 and use a Pooled data source. In order to support emojis, the only way I have found is to call the query "SET NAMES utf8mb4" on the connection whenever the connection is obtained.
But under load we are observing that this query takes a long time (around 222 ms).. Is there any alternative to this query?
Things tried so far:
1. Tried setting charSet, characterEncoding on jdbc connection url
2. The columns in the table use utf8mb4 encoding and utf8mb4_unicide_ci collation
3. MySql is on RDS, not yet changed the character_set_server etc. variables on RDS

ERROR 1115 (42000): Unknown character set: 'utf8mb4'

I have a MySQL dump, which I tried to restore with:
mysql -u"username" -p"password" --host="127.0.0.1" mysql_db < mysql_db
However, this threw an error:
ERROR 1115 (42000) at line 3231: Unknown character set: 'utf8mb4'
This is lines 3231-3233:
/*!50003 SET character_set_client = utf8mb4 */ ;
/*!50003 SET character_set_results = utf8mb4 */ ;
/*!50003 SET collation_connection = utf8mb4_general_ci */ ;
I am using MySQL 5.1.69. How can I solve this error?
Your version does not support that character set, I believe it was 5.5.3 that introduced it. You should upgrade your mysql to the version you used to export this file.
The error is then quite clear: you set a certain character set in your code, but your mysql version does not support it, and therefore does not know about it.
According to https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html :
utf8mb4 is a superset of utf8
so maybe there is a chance you can just make it utf8, close your eyes and hope, but that would depend on your data, and I'd not recommend it.
You can try:
Open sql file by text editor find and replace all
utf8mb4 to utf8
Import again.
This can help:
mysqldump --compatible=mysql40 -u user -p DB > dumpfile.sql
PHPMyAdmin has the same MySQL compatibility mode in the 'expert' export options. Although that has on occasions done nothing.
If you don't have access via the command line or via PHPMyAdmin then editing the
/*!50003 SET character_set_client = utf8mb4 */ ;
bit to read 'utf8' only, is the way to go.
I am answering the question - as I didn't find any of them complete. As nowadays Unknown character set: 'utf8mb4' is quite prevalent as lot of deployments have MySQL less then 5.5.3 (version in which utf8mb4 was added).
The error clearly states that you don't have utf8mb4 supported on your stage db server.
Cause: probably locally you have MySQL version 5.5.3 or greater, and on stage/hosted VPS you have MySQL server version less then 5.5.3
The utf8mb4 character sets was added in MySQL 5.5.3.
utf8mb4 was added because of a bug in MySQL's utf8 character set.
MySQL's handling of the utf8 character set only allows a maximum of 3
bytes for a single codepoint, which isn't enough to represent the
entirety of Unicode (Maximum codepoint = 0x10FFFF). Because they
didn't want to potentially break any stuff that relied on this buggy
behaviour, utf8mb4 was added. Documentation here.
From SO answer:
Verification:
To verify you can check the current character set and collation for the DB you're importing the dump from - How do I see what character set a MySQL database / table / column is?
Solution 1: Simply upgrade your MySQL server to 5.5.3 (at-least) - for next time be conscious about the version you use locally, for stage, and for prod, all must have to be same. A suggestion - in present the default character set should be utf8mb4.
Solution 2 (not recommended): Convert the current character set to utf8, and then export the data - it'll load ok.
Just open your sql file with a text editor and search for 'utf8mb4' and replace with utf8.I hope it would work for you
maybe whole database + tables + fields should have the same charset??!
i.e.
CREATE TABLE `politicas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Nombre` varchar(250) CHARACTER SET utf8 NOT NULL,
-------------------------------------^here!!!!!!!!!!!
PRIMARY KEY (`ID`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
-------------------------------------------------^here!!!!!!!!!
As some suggested here, replacing utf8mb4 with utf8 will help you resolve the issue. IMHO, I used sed to find and replace them to avoid losing data. In addition, opening a large file into any graphical editor is potential pain. My MySQL data grows up 2 GB. The ultimate command is
sed 's/utf8mb4_unicode_520_ci/utf8_unicode_ci/g' original-mysql-data.sql > updated-mysql-data.sql
sed 's/utf8mb4/utf8/g' original-mysql-data.sql > updated-mysql-data.sql
Done!
Open your mysql file any edit tool
find
/*!40101 SET NAMES utf8mb4 */;
change
/*!40101 SET NAMES utf8 */;
Save and upload ur mysql.

Non-English characters are not shown correctly in a mysql database inside phpMyAdmin (xampp)

My problem:
I'v been given a mysql database in a non-English language (Persian or Farsi, & if you don't know what kind of language is that, it's like Arabic). The records were entered through a web interface by php, using a windows machine. When I want to view the database using phpMyAdmin in xampp, the records look like this:
مرکز آموزش توپخانه نزاجا
If I edit the records in phpMyAdmin, I can add non-English (Persian) characters and they look fine, only the existing data is incorrectly displayed.
I've been provided with a .sql backup file as well, but when I open it in Notepad++ it doesn't look right either. I also tried "Encode in UTF-8" in Notepad++, but no use.
What I want:
A correct representation in phpMyAdmin or a healthy .sql file.
What I have:
xampp 1.8.2 (Apache 2.4.4, MySQL 5.5.32, PHP 5.4.16, phpMyAdmin 4.0.4), win 7 x64
The files I have:
.frm .MYD .MYI files (which I copied to xampp\mysql\data\mxpro), the .sql file i mentioned (mxpro.sql) & db.opt file containing these 2 lines:
default-character-set=utf8
default-collation=utf8_general_ci
I've found this line included inside the .sql file:
CHARSET=latin1
All of these files are inside a folder called 'mxpro' located in xampp\mysql\data\.
The collation of the table columns in the phpMyAdmin are: latin1_swedish_ci
What I have tried:
First of all, when I open the MYI file in Notepad++ and use "Encode in UTF-8", I can see most of the data sitting there in the correct format (Persian).
I've tried the following based on my research:
1) Changing whatever I see to utf8_general_ci, including: database (mxpro) collation (through operations), table collation (through operations), columns collation & server connection collation (in general settings)
2) Changing these server variables to utf8: character set client, character set connection, character set database, character set results, character set server & character set system.
3) Changing these server variables to utf8_unicode_ci: collation connection, collation database & collation server.
4) Adding this line:
#MySQL_Query("SET NAMES utf8");
to xampp\php\pear\MDB2\Driver\mysql.php after this line:
$connection = #call_user_func_array($connect_function, $params);
5) Adding these 3 lines to my.ini:
collation_server=utf8_unicode_ci
character_set_server=utf8
skip-character-set-client-handshake
6) Adding these 2 lines:
mysqli_query($link, "SET SESSION CHARACTER_SET_RESULTS =latin1;");
mysqli_query($link, "SET SESSION CHARACTER_SET_CLIENT =latin1;");
to xampp\phpMyAdmin\libraries\dbi\mysql.dbi.lib.php, below the following line:
PMA_DBI_postConnect($link, $is_controluser);
7) Changing this:
'utf-8' => 'utf8',
to this:
'utf-8' => 'latin1',
in xampp\phpMyAdmin\libraries\select_lang.lib.php
despite my efforts, no outcome yet.
Thank you in advance.
On the phpMyAdmin wiki, there is an article explaining this issue:
http://wiki.phpmyadmin.net/pma/Garbled_data
I used this query on my db and it's worked perfectly
ALTER DATABASE yourDB CHARACTER SET utf8 COLLATE utf8_bin