How to set unicode characters to database - mysql

I am working on twitter API in java I want to save search tweets in mysql database,I have changed default encoding type of table to utf-8 and collate to utf8_unicode_ci,also for column for which I am getting unicode values I have set default encoding type of to utf-8
and collate to utf8_unicode_ci. But stiil I am gettin data truncated for column,my data is not saved properly.
Please help me out.
Thanks in advance

Try to set the Connection Character Sets and Collations too using:
SET NAMES 'charset_name' [COLLATE 'collation_name']
and
SET CHARACTER SET charset_name

This post is quite old but since I was looking into the same issue today I stumbled into your question.
Since twitter supports emoticons aka Emoji you will have to switch to utf8mb4 instead of utf8. In a nutshell turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported!
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.
Here is a link to a tutorial discussing the matter and detaily explains how to do the conversion to utf8mb4.

Related

Some unicode icons are not saved and displayed correct

I'am trying to add unicode icons to my website SEO title/meta and for some reason it will not accept some icons. My site is UTF-8. Im saving it in my database as utf8_general_ci.
When i add the icon 💲 it will return as ????
https://emojipedia.org/heavy-dollar-sign/
When I add the icon ✔️ it will add the ✔️ in the title.
https://emojipedia.org/check-mark/
Is there an reason for this or is this normal?
"✔️" is a 3-byte Emoji; "💲" takes 4 bytes. So the problem is that CHARACTER SET utf8 needs to be changed to CHARACTER SET utf8mb4.
The solution is to either
Provide utf8mb4 in the connection parameters`. This action varies with the client (Java/PHP/...).
Add SET NAMES utf8mb4 to the code right after connecting.
If these are not specific enough, search for where you have utf8 and it could be utf8mb4. (Note: I am not saying UTF-8, which is the non-MySQL equivalent of utf8mb4.)
More discussion: Trouble with UTF-8 characters; what I see is not what I stored
Technically the checkmark takes 6 bytes -- two 3-byte characters: hex E29C94 EFB88F. So it worked fine with utf8. However the dollar sign needs 4 bytes: F09F92B2, so it could not be represented in utf8, only in utf8mb4. The failure was shown via 4 question marks.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

MySQL collation to store multilingual data of unknown language

I am new to multilingual data and my confession is that I never did tried it before.
Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.
You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.
UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).
You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.

what encoding stackoverflow used in mysql?

I can not save the character 𝑴 in my mysql which encoding is utf8, but i found stackoverflow can save it and display it.
I made a mistake. stackoverflow also can not save 𝑴 .
If you can't store the character, you are encoding or decoding it incorrectly, or converting it to a character set that doesn't support the character.
The UTF-8 encoding can handle almost any character that exists in any language, so it's quite unlikely that it's a limitation of that encoding.
You have to use the Unicode character set or some Unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32) for all steps of the process. If you convert the text to some other character set and then back, you can only support the characters of that specific character set.
Stackoverflow is trying to display the character as &#119924. So maybe that character value is being saved in the database (certainly some character value is being saved in the database), but the reason why we can't see that character is because of the font which is being used to display the HTML: perhaps it's the font being used, not the database, that doesn't support that character value.