OpenJPA & MySQL persist wrong encoded characters - mysql

my mysql db has character encoding utf8. In QueryBrowser i can see special characters are correct. In appplication using openjpa i can see the same values also correct.
But when I persist object into DB, I have correct values in application but incorrect in DB!
When I restart application that special characters in application are incorrect.(as they are picked from DB)
All is set to UTF-8, java application works well, reading data from DB is correct but problem is when openjpa stores values in DB, they turn into '?'.
Any ideas? Thanks

Check your encoding on the MySql server configuration level (my.cnf file), and also on the level of a specific database. Once, I had similar issue when these two options has been set to different values (encodings).

Related

UTF-8 encoding problem while importing a sql file

I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.

Querying mysql database with SQLDeveloper doesn't return correct values

I have a mysql database with a charset utf8 of all the tables.
I am using SQLDeveloper to access and query the database with the latest JConnector JDBC driver.
When executing a simple query such as SELECT 'Варна'; equivalent to SELECT 'Варна' from DUAL;, which contains Bulgarian language, SQLDeveloper returns '?????'. This makes selects from the database in which I have used Bulgarian language return NULL, because their where clauses (containing Bulgarian language) mismatch the uft8 Bulgarian characters in the database. (When the select doesn't use Bulgarian language at all SQLDeveloper returns completely correct values and displays the Bulgarian language returned as a result of the query correctly.)
The Preferences -> Environment -> Encoding in SQLDeveloper is set currently to UTF-8, but I have tried virtually every applicable encoding listed in there and even the simplest query SELECT 'Варна' from DUAL; still does not return back the correct value Варна.
I have looked into setting the variable NLS_LANG, thinking this may be the cause but to no avail. (Perhaps it is the key after all but I am unable to actually configure it properly).
Edit: In order to reproduce the problem and visualise it (as I realise I may have explained it poorly) just go in SQLDeveloper and connect to a mysql database and execute the query SELECT 'Варна' from DUAL;.
Edit2: Clarifications.
Edit3: As shown by the comment left by #tenhouse it appears that this may be a bug.
Edit4: As stated below as a comment, the above query SELECT 'Варна' from DUAL; works perfectly fine without any modifications and/or settings fiddling on MySQL Workbench.
Edit5: Please, feel free to correct the title and/or tags if you feel that something can be improved as there is still no answer to the problem.
Edit6: By now can I assume that it really is a bug? Could anyone advise me where exactly to report it - is it a JConnector or SQLDeveloper related bug. I would think that I have to report it as a SQLDeveloper bug but I'd rather get a confirmation before possibly wasting their time.
Edit7: Tried to clarify it even further in my hopes for an answer.
Edit8: (Important!) My current database is hosted on linux (Ubuntu 12.04, MySQL 5.5.28) server. If, however, I install MySQL on a fresh Windows machine and create a utf8 db there, querying through SQLDeveloper works as it is supposed to, SELECT 'Варна' from DUAL; actually returns Варна. Could someone please confirm this?
So I didn't know this myself till having this problem a few months back, but MySQL actually offers the ability for different encodings for clients, databases, and connections. MySQL will convert (or collate) the requests/responses from/to a client to different encodings as specified by the client or in his config file. So even though the database is storing stuff as utf8, if the client is set to latin1, your going to see latin1 as your result encoding. The easiest way to check this is to spin up a connection to MySQL and run the following query:
SHOW VARIABLES LIKE "%char%";
You should see a whole bunch of encodings for different connections/sources. From your description, I imagine most of these will not be utf8. Here's mysql's doc on what each of these mean. You can test if this in fact the problem by doing a SET NAMES 'utf8'; or charset utf8; (can't remember which one) and running your queries again to see if that fixes the problem.
A summary of what each of these guys does (since the docs leave some stuff out):
character_set_client: specifies how data is encoded when sending from client to server. Anything connecting through MySQL's API is not a client (ex. php's mysqli, most C/C++ wrapper libs)
character_set_database: specifies the encoding for data stored in the database
character_set_filesystem: not really sure, but I believe how data is written to disk?
character_set_results: the encoding that MySQL returns query results
character_set_server: server's default set (not really sure cases where this is used)
character_set_system: not sure on this one either
character_sets_dir: where your collation/encoding definitions are located
Most of these guys can be specified by editing your my.cnf and sticking your defaults in there.
I'm not exactly sure how JConnector works, but I imagine it uses MySQL's C API, in which case you'll need to do something like the following somewhere in the code. Maybe JConnector has a way for you to set this through him. I'm not sure, but here's the syntax for the MySQL API:
mysql_options( myLink, MYSQL_SET_CHARSET_NAME, "utf8" );
EDIT: For MySQL 5.5
You can try a command like this ::
ALTER DATABASE CHARACTER SET WE8ISO8859P5;
Please restart the DB after changing the characterset.
More details refer this link where it explains about the encoding required for different languages
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67789/ch3.htm
after you connect with a mysql_connect:
$dbcnx = mysql_connect($dbhost, $dbuser, $dbpass)
you do this query:
mysql_query("SET
character_set_results = 'utf8',
character_set_client = 'utf8',
character_set_connection = 'utf8',
character_set_database = 'utf8',
character_set_server = 'utf8'",
$dbcnx);
Now this will set the encoding for what is returned, what happens on the server - so all of it has the same encoding.
In your following query's, you specify this connection to be used
Export
Add [?characterEncoding=utf8]
<StringRefAddr addrType="customUrl">
<Contents>jdbc:mysql://instance_host_name:3306/database_name?characterEncoding=utf8</Contents>
</StringRefAddr>
Import

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

Getting incorrectly encoded characters when retrieving values from MySQL DB

I'm encoding all my files in UTF-8 without BOM.
All characters appear correctly when displaying my page, yet values retrieved from my MySQL database such as Árbol render Ás and nont ANSI characters incorrectly (as a black romboid). I'm storing values in database with PHP My Admin manually.
What can be causing this issue with the database? Are there any changes I must do to make the database "utf8-ready"?
After you connect to the server, use the following command to view your current client characterset:
status;
To set it, use the following command:
set names utf8;

Managing Unicode Data in MySQL and VB.NET

I wish to develop a client-server application in VB.NET. I want to store some fields in
Unicode. As per MySQL documentation I tried the fields with varchar and charset UTF-8 for storing Unicode data.
I could insert data using the MySQL connector command object but when I try to display data in datagridview some junk is appearing.
What am I missing?
I don't know VB.NET, but you should have the possibility to set the encoding of the database connection from your application to MySQL when setting up to connection. Is that part set to UTF-8 as well?
Alternatively you can try issuing the following MySQL command after you connect:
SET NAMES utf8