MySQL collation to store multilingual data of unknown language - mysql

I am new to multilingual data and my confession is that I never did tried it before.
Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.

You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.

UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).

You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.

Related

Proper way to store BCrypt Hashes on MySQL

Searching for the proper way to store BCrypt hashes in MySQL I found this question and it only made me more confuse.
The accepted answer point out that we should use:
CHAR(60) BINARY or BINARY(60)
But other people on the comments argue that instead we should use:
CHAR(60) CHARACTER SET latin1 COLLATE latin1_bin
or even:
COLLATE latin1_general_cs
I am not a specialist on databases so could anyone explain me the difference between all these options and which one is truly better for storing BCrypt hashes?
My answer is in the line of "what is proper", rather than "what will work".
Do not use latin1. Sure, it might work, but it is ugly to claim that the encrypted string is text when it is not.
Ditto for saying CHAR....
Simply say BINARY(...) if fixed length or VARBINARY(...) if it can vary in length.
However, there is a gotcha... Whose BCrypt are you using? Does it return binary data? Or a hex string? Or maybe even Base64?
My above answer assumed it returns binary data.
If it returns 60 hex digits, then store UNHEX(60_hex_digits) into BINARY(30) so that it is packed smaller.
If it is Base64, then CHARACTER SET ascii COLLATE ascii_bin would be "proper". (latin1 with a case-sensitive collation would also work.)
If it is binary, then, again, BINARY(60) is the 'proper' way to do it.
The link you provided looks like Base64, but is it? And is it up to 60 characters? Then I would use
VARCHAR(60) CHARACTER SET ascii COLLATE ascii_bin
And explicitly state the charset/collation for the column, thereby overriding the database and/or table "defaults".
All the Base64 chars (and $) are ascii; no need for a more complex charset. Collating with a ..._bin means "compare bytes exactly"; more specifically "don't do case folding". Since Base64 depends on distinguishing between upper and lower case letters, you don't want case folding.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

Do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'm swapping forum software, and the old forum database used Latin1 encoding. The new forum database uses UTF8 encoding for tables.
It looks like the importer script did a straight copy from one table to another without trying to fix any encoding issues.
I've been manually fixing the visible errors using a find-and-replace based on the conversion info listed here: http://www.i18nqa.com/debug/utf8-debug.html
The rest of the text looks fine and is completely readable.
My limited understanding is that UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated.
So do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'd rather not because I've changed some of the BB Code tags on a number of the fields after they were stored in UTF 8, so concerned that those updates would have stuck UTF8 characters in the middle of the Latin1 characters, and trying to do a full conversion on mixed character sets will just muck things up further.
Any characters from ISO 8859-1 (Latin 1) in the range 0x80..0xFF need to be recoded as 2 bytes in UTF-8. The first byte is 0xC2 for 0x80..0xBF; the first byte is 0xC3 for 0xC0..0xFF. The second byte is derived from the original value from Latin 1 by setting the two most significant bits to 1 and 0. For the characters 0x80..0xBF, the value of the second byte is unchanged from Latin 1. If you were using 8859-15, you may have a few more complex conversions (the Euro symbol is encoded differently from other Latin 1 characters).
There are tools aplenty to assist. iconv is one such.

How to set unicode characters to database

I am working on twitter API in java I want to save search tweets in mysql database,I have changed default encoding type of table to utf-8 and collate to utf8_unicode_ci,also for column for which I am getting unicode values I have set default encoding type of to utf-8
and collate to utf8_unicode_ci. But stiil I am gettin data truncated for column,my data is not saved properly.
Please help me out.
Thanks in advance
Try to set the Connection Character Sets and Collations too using:
SET NAMES 'charset_name' [COLLATE 'collation_name']
and
SET CHARACTER SET charset_name
This post is quite old but since I was looking into the same issue today I stumbled into your question.
Since twitter supports emoticons aka Emoji you will have to switch to utf8mb4 instead of utf8. In a nutshell turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported!
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.
Here is a link to a tutorial discussing the matter and detaily explains how to do the conversion to utf8mb4.

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.