Searching for the proper way to store BCrypt hashes in MySQL I found this question and it only made me more confuse.
The accepted answer point out that we should use:
CHAR(60) BINARY or BINARY(60)
But other people on the comments argue that instead we should use:
CHAR(60) CHARACTER SET latin1 COLLATE latin1_bin
or even:
COLLATE latin1_general_cs
I am not a specialist on databases so could anyone explain me the difference between all these options and which one is truly better for storing BCrypt hashes?
My answer is in the line of "what is proper", rather than "what will work".
Do not use latin1. Sure, it might work, but it is ugly to claim that the encrypted string is text when it is not.
Ditto for saying CHAR....
Simply say BINARY(...) if fixed length or VARBINARY(...) if it can vary in length.
However, there is a gotcha... Whose BCrypt are you using? Does it return binary data? Or a hex string? Or maybe even Base64?
My above answer assumed it returns binary data.
If it returns 60 hex digits, then store UNHEX(60_hex_digits) into BINARY(30) so that it is packed smaller.
If it is Base64, then CHARACTER SET ascii COLLATE ascii_bin would be "proper". (latin1 with a case-sensitive collation would also work.)
If it is binary, then, again, BINARY(60) is the 'proper' way to do it.
The link you provided looks like Base64, but is it? And is it up to 60 characters? Then I would use
VARCHAR(60) CHARACTER SET ascii COLLATE ascii_bin
And explicitly state the charset/collation for the column, thereby overriding the database and/or table "defaults".
All the Base64 chars (and $) are ascii; no need for a more complex charset. Collating with a ..._bin means "compare bytes exactly"; more specifically "don't do case folding". Since Base64 depends on distinguishing between upper and lower case letters, you don't want case folding.
Related
Recently I exported parts of my mySQL database, and noticed that the text had several strange characters in it. For example, the string ’ often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’ stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter  often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’ means ', or, more importantly for me, what  stands for. I'd be very grateful if someone could point me in the right direction.
The latin1 encoding for ’ is (in hex) E28099.
The utf8 encoding for ’ is E28099.
But you pasted in C3A2E282ACE284A2, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’ in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT statement dutifully treated it as 3 latin1 characters E2 80 99 (visually ’), and converted each one to utf8, hex C3A2 E282AC E284A2.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â. In latin1, that is hex C2. In utf8, many punctuation marks are encoded as 2 bytes: C2xx. For example, the copyright symbol, © is utf8 hex C2A9, which is misinterpreted ©.
I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.
I am working on twitter API in java I want to save search tweets in mysql database,I have changed default encoding type of table to utf-8 and collate to utf8_unicode_ci,also for column for which I am getting unicode values I have set default encoding type of to utf-8
and collate to utf8_unicode_ci. But stiil I am gettin data truncated for column,my data is not saved properly.
Please help me out.
Thanks in advance
Try to set the Connection Character Sets and Collations too using:
SET NAMES 'charset_name' [COLLATE 'collation_name']
and
SET CHARACTER SET charset_name
This post is quite old but since I was looking into the same issue today I stumbled into your question.
Since twitter supports emoticons aka Emoji you will have to switch to utf8mb4 instead of utf8. In a nutshell turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported!
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.
Here is a link to a tutorial discussing the matter and detaily explains how to do the conversion to utf8mb4.
What's the difference between binary(10) vs char(10)character set binary?
And varbinary(10) vs varchar(10)character set binary?
Are they synonymous in all MySQL engines?
Is there any gotcha to watch out for?
There isn't a difference.
However, there is a caveat if you're storing a string.
If you only want to store a byte array or other binary data such as a stream or file then use the binary type as that is what they are meant for.
Quote from the MySQL manual:
The use of CHARACTER SET binary in the definition of a CHAR, VARCHAR,
or TEXT column causes the column to be treated as a binary data type.
For example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary
BINARY(10)
VARCHAR(10) CHARACTER SET binary
VARBINARY(10)
TEXT CHARACTER SET binary
BLOB
So, technically there is no difference.
However, when storing a string it must be converted from a string to byte values using a character set. The decision is to either do this yourself before the MySQL server or you leave it up to MySQL do to do for you. MySQL will perform with by casting a string to BINARY using the BIN character sets.
If you want to store the encoding in another format, lets say you have a business requirement that says you must use 4 bytes per character (MySQL doesn't do this by default) you could then use the CHARACTER SET BINARY to a textual column and perform the character set encoding yourself.
It is also worth reading The BINARY and VARBINARY Types from the MySQL manual as this details important information such as padding.
Summary:
There is no technical difference as one is a synonym to the other. In my opinion it makes logical sense to store binary strings in data types that would normally hold a string using the CHARACTER SET BINARY and to store byte arrays / streams etc in BINARY fields that cannot be represented by transforming the data though a character set.
I am new to multilingual data and my confession is that I never did tried it before.
Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.
You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.
UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).
You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.