MySQL encoding issue - mysql

For some reason, my mysql table is converting single and double quotes into strange characters. E.g
"aha"
is changed into:
“ahaâ€
How can I fix this, or detect this in PHP and decode everything??

The encoding of your mysql client and your server don't match. Use SET NAMES to match the character set of the connection to the one used in your PHP files.

It seems that the UTF-8 encoded string “aha” (binary 0xE2809C 0x61 0x68 0x61 0xE2809D) is interpreted with Windows-1252. There this byte sequence represents the character sequence “ahaâ€.

Related

How can I find out which character a certain string is encoding in my database?

Recently I exported parts of my mySQL database, and noticed that the text had several strange characters in it. For example, the string ’ often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’ stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter  often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’ means ', or, more importantly for me, what  stands for. I'd be very grateful if someone could point me in the right direction.
The latin1 encoding for ’ is (in hex) E28099.
The utf8 encoding for ’ is E28099.
But you pasted in C3A2E282ACE284A2, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’ in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT statement dutifully treated it as 3 latin1 characters E2 80 99 (visually ’), and converted each one to utf8, hex C3A2 E282AC E284A2.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â. In latin1, that is hex C2. In utf8, many punctuation marks are encoded as 2 bytes: C2xx. For example, the copyright symbol, © is utf8 hex C2A9, which is misinterpreted ©.

CGI script having trouble sending emojis characters from database

I am storing emojis in a MySQL database expressed in UTF8 Bytes, like "\xf0\x9f\x98\x80", which is the Unicode character U+1F600 GRINNING FACE
It is fine if I copy and paste it in and test it like this
print MAIL "Subject: \xf0\x9f\x98\x80\n";
It works and sends me the emoji.
But if I tell the script to get it from the database and plug it in like this:
print MAIL "Subject: $subject\n";
It will give me the subject: \xf0\x9f\x98\x80
What do I need to do? I thought if I was storing it in bytes it would see it as plain text and it would work.
It seems most likely that you have added the value to the database wrongly.
If you use Perl code and write the string '\xf0\x9f\x98\x80' to the database (note the single quotes) then you will get exactly the symptoms you describe. Your database will contain the sixteen-character ASCII string \xf0\x9f\x98\x80 and it will be displayed as such.
You shouldn't be involved with the UTF-8 encoded bytes; it is far better to specify the Unicode code point either by name or number
All of these produce the same Perl UTF-8-encoded string
$s = "\N{U+1F600}";
$s = "\N{GRINNING FACE}";
$s = "\x{1F600}";
The corresponding encoded bytes are irrelevant to the programmer, but if you must you can use the Encode module like this
use Encode 'decode_utf8';
$s = decode_utf8 "\xf0\x9f\x98\x80";
Another way is to enter the character directly into your code. You will need use utf8 to indicate to the compiler that the source contains non-ASCII UTF-8-encoded characters, like this
use utf8;
$s = "😀";
All of these assignments to $s will produce exactly the same result, and the values will compare as being equal using eq
On the database side you need a MySQL column with a four-byte UTF-8 character set, for instance
column VARCHAR(50) CHARACTER SET utf8mb4
Note that the character set must be utf8mb4 as if you use the earlier utf8 then you would be restricted to three-byte encoding, whereas emoji characters are all four bytes

MySQL "binary" vs "char character set binary"

What's the difference between binary(10) vs char(10)character set binary?
And varbinary(10) vs varchar(10)character set binary?
Are they synonymous in all MySQL engines?
Is there any gotcha to watch out for?
There isn't a difference.
However, there is a caveat if you're storing a string.
If you only want to store a byte array or other binary data such as a stream or file then use the binary type as that is what they are meant for.
Quote from the MySQL manual:
The use of CHARACTER SET binary in the definition of a CHAR, VARCHAR,
or TEXT column causes the column to be treated as a binary data type.
For example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary
BINARY(10)
VARCHAR(10) CHARACTER SET binary
VARBINARY(10)
TEXT CHARACTER SET binary
BLOB
So, technically there is no difference.
However, when storing a string it must be converted from a string to byte values using a character set. The decision is to either do this yourself before the MySQL server or you leave it up to MySQL do to do for you. MySQL will perform with by casting a string to BINARY using the BIN character sets.
If you want to store the encoding in another format, lets say you have a business requirement that says you must use 4 bytes per character (MySQL doesn't do this by default) you could then use the CHARACTER SET BINARY to a textual column and perform the character set encoding yourself.
It is also worth reading The BINARY and VARBINARY Types from the MySQL manual as this details important information such as padding.
Summary:
There is no technical difference as one is a synonym to the other. In my opinion it makes logical sense to store binary strings in data types that would normally hold a string using the CHARACTER SET BINARY and to store byte arrays / streams etc in BINARY fields that cannot be represented by transforming the data though a character set.

confused by html5, utf-8 and 8859-1

Yesterday I upgraded an html page from "4.01 strict" to html5.
* http://r0k.us/rock/games/CoH/HallsOfHeroes/
The character encoding is iso-8859-1. The http://validator.w3.org fails and won't even parse it when utf-8 is specified as charset, apparently because I use footnote characters such as ² . They are in the upper 128 bytes of the character set. What confuses me is that I keep reading that the first 256 bytes of utf-8 is 8859-1.
Does anyone know why the page won't validate as utf-8 ?
Actually, only the first 128 code points are encoded in UTF-8 as ASCII, but UTF-8 is not ASCII, in particular, the next 128 code points differ.
You need to re-save the files as UTF-8 if you want them to be served as UTF-8.
The character ² ("SUPERSCRIPT TWO") is represented by the number 0xb2 (178 decimal) -- but it's represented differently in 8859-1 and UTF-8.
In 8859-1, it's represented as a single byte with the value 0xb2.
In UTF-8, it's represented as two consecutive bytes with the values 0xc2, 0xb2. See here for an explanation of the encoding.
(8859-1 is more compact that UTF-8 for files containing 8-bit characters, but it's incapable of representing anything past 255. UTF-8 is compatible with ASCII and with 8859-1 for 7-bit characters, is reasonably compact for most text, and can represent more than a million distinct characters.)
A file containing only 7-bit characters can be interpreted either as ASCII, 8859-1, or UTF-8. A file containing 8-bit characters cannot; it has to be translated.
If you're on a Unix-like system with the iconv command installed, this:
iconv -f iso-8859-1 -t utf-8
will perform the appropriate translation.

Deciphering MySQL Encoding

I'm having an issue with encoding in MySQL, and I need some help in figuring out what's going on.
First, some parameters. The default encoding of the table is utf8. The character_set_client, character_set_connection, collation_connection, and character_set_server MySQL system variables, though, are all latin1.
I ssh into my MySQL server and I connect to the local server using the local command line client. I select record/column and the string that's returned, let's say the character comes back as A, which is correct. A is represented by hex in UTF-8 as "C5 9F."
However, the PHP app that hits the server interprets it as XY. In the MySQL commandline client, if I send the command "SET NAMES utf8", it will also now display it as XY.
If I do a select INTO OUTFILE and use hexedit to edit the file, I see two hex characters that map to X, then two hex characters that map to Y. ("c3 85" for X and "C5 B8" for Y). Basically, it's taking the two hex values and displaying them indeed as UTF8 characters.
First and foremost, it looks like the database is indeed storing things as UTF8, but the wrong kind of UTF8, correct? Are they going in as raw Unicode, but somehow, maybe because of the sytem variables, it is not being translated to UTF8?
Second, how/why is the MySQL command line client correctly interpreting XY as A?
Finally, to the successful interpretation of the MySQL command line, is there a chart that shows how C3 85 C5 B8 is getting converted to A, or XY is getting converted to A?
Thanks a bunch for any insight.
Your question is kind of confusing, so I'll explain with an example of my own:
You connect to the database without issuing SET NAMES, so the connection is set to Latin-1. That means the database expects any communication between you and it to be encoded in Latin-1.
You send the bytes C3A2 to the database, which you want to mean "â" in the UTF-8 encoding.
The database, expecting Latin-1, is interpreting this as the characters "¢" (C3 and A2 in the Latin-1 encoding).
The database will store these two characters internally in whatever encoding the table is set to.
You connect to the database in a different fashion, running SET NAMES UTF-8. The database now expects to talk to you in UTF-8.
You query the data stored in the database, you receive the characters "¢" encoded in UTF-8 as C382 C2A2, because you told the database to store the characters "¢" and you are now querying them over a UTF-8 connection.
If you connected to the database again using Latin-1 for the connection, the database would give you the characters "¢" encoded in Latin-1, which are the bytes C3 A2. If the client that you used to connect is interpreting that in Latin-1, you'll see the characters "¢". If the client is interpreting that as UTF-8, you'll see the character "â".
Essentially these are the points at which something can screw up:
the database will interpret any bytes it receives as characters in whatever encoding is set for the connection and convert the encoding of these characters to match the table they're supposed to be stored in
the database will convert the encoding of any characters from the encoding they're stored in into the encoding of the connection when retrieving data
the client may or may not interpret the bytes it receives from the database into the right characters to display on screen, especially command line environments aren't always set to correctly display UTF-8 data
Hope that helps.