Some unicode icons are not saved and displayed correct - html

I'am trying to add unicode icons to my website SEO title/meta and for some reason it will not accept some icons. My site is UTF-8. Im saving it in my database as utf8_general_ci.
When i add the icon 💲 it will return as ????
https://emojipedia.org/heavy-dollar-sign/
When I add the icon ✔️ it will add the ✔️ in the title.
https://emojipedia.org/check-mark/
Is there an reason for this or is this normal?

"✔️" is a 3-byte Emoji; "💲" takes 4 bytes. So the problem is that CHARACTER SET utf8 needs to be changed to CHARACTER SET utf8mb4.
The solution is to either
Provide utf8mb4 in the connection parameters`. This action varies with the client (Java/PHP/...).
Add SET NAMES utf8mb4 to the code right after connecting.
If these are not specific enough, search for where you have utf8 and it could be utf8mb4. (Note: I am not saying UTF-8, which is the non-MySQL equivalent of utf8mb4.)
More discussion: Trouble with UTF-8 characters; what I see is not what I stored
Technically the checkmark takes 6 bytes -- two 3-byte characters: hex E29C94 EFB88F. So it worked fine with utf8. However the dollar sign needs 4 bytes: F09F92B2, so it could not be represented in utf8, only in utf8mb4. The failure was shown via 4 question marks.

Related

How can I find out which character a certain string is encoding in my database?

Recently I exported parts of my mySQL database, and noticed that the text had several strange characters in it. For example, the string ’ often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’ stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter  often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’ means ', or, more importantly for me, what  stands for. I'd be very grateful if someone could point me in the right direction.
The latin1 encoding for ’ is (in hex) E28099.
The utf8 encoding for ’ is E28099.
But you pasted in C3A2E282ACE284A2, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’ in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT statement dutifully treated it as 3 latin1 characters E2 80 99 (visually ’), and converted each one to utf8, hex C3A2 E282AC E284A2.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â. In latin1, that is hex C2. In utf8, many punctuation marks are encoded as 2 bytes: C2xx. For example, the copyright symbol, © is utf8 hex C2A9, which is misinterpreted ©.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

Do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'm swapping forum software, and the old forum database used Latin1 encoding. The new forum database uses UTF8 encoding for tables.
It looks like the importer script did a straight copy from one table to another without trying to fix any encoding issues.
I've been manually fixing the visible errors using a find-and-replace based on the conversion info listed here: http://www.i18nqa.com/debug/utf8-debug.html
The rest of the text looks fine and is completely readable.
My limited understanding is that UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated.
So do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'd rather not because I've changed some of the BB Code tags on a number of the fields after they were stored in UTF 8, so concerned that those updates would have stuck UTF8 characters in the middle of the Latin1 characters, and trying to do a full conversion on mixed character sets will just muck things up further.
Any characters from ISO 8859-1 (Latin 1) in the range 0x80..0xFF need to be recoded as 2 bytes in UTF-8. The first byte is 0xC2 for 0x80..0xBF; the first byte is 0xC3 for 0xC0..0xFF. The second byte is derived from the original value from Latin 1 by setting the two most significant bits to 1 and 0. For the characters 0x80..0xBF, the value of the second byte is unchanged from Latin 1. If you were using 8859-15, you may have a few more complex conversions (the Euro symbol is encoded differently from other Latin 1 characters).
There are tools aplenty to assist. iconv is one such.

How to set unicode characters to database

I am working on twitter API in java I want to save search tweets in mysql database,I have changed default encoding type of table to utf-8 and collate to utf8_unicode_ci,also for column for which I am getting unicode values I have set default encoding type of to utf-8
and collate to utf8_unicode_ci. But stiil I am gettin data truncated for column,my data is not saved properly.
Please help me out.
Thanks in advance
Try to set the Connection Character Sets and Collations too using:
SET NAMES 'charset_name' [COLLATE 'collation_name']
and
SET CHARACTER SET charset_name
This post is quite old but since I was looking into the same issue today I stumbled into your question.
Since twitter supports emoticons aka Emoji you will have to switch to utf8mb4 instead of utf8. In a nutshell turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported!
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.
Here is a link to a tutorial discussing the matter and detaily explains how to do the conversion to utf8mb4.

what encoding stackoverflow used in mysql?

I can not save the character 𝑴 in my mysql which encoding is utf8, but i found stackoverflow can save it and display it.
I made a mistake. stackoverflow also can not save 𝑴 .
If you can't store the character, you are encoding or decoding it incorrectly, or converting it to a character set that doesn't support the character.
The UTF-8 encoding can handle almost any character that exists in any language, so it's quite unlikely that it's a limitation of that encoding.
You have to use the Unicode character set or some Unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32) for all steps of the process. If you convert the text to some other character set and then back, you can only support the characters of that specific character set.
Stackoverflow is trying to display the character as &#119924. So maybe that character value is being saved in the database (certainly some character value is being saved in the database), but the reason why we can't see that character is because of the font which is being used to display the HTML: perhaps it's the font being used, not the database, that doesn't support that character value.